[jira] [Commented] (NUTCH-1749) Optionally exclude title from content field

2021-01-12 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-1749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17263699#comment-17263699
 ] 

ASF GitHub Bot commented on NUTCH-1749:
---

steeveb972 commented on pull request #285:
URL: https://github.com/apache/nutch/pull/285#issuecomment-759002533


   Hi, it has been a while ;-) I will have a look at it this week-end



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Optionally exclude title from content field
> ---
>
> Key: NUTCH-1749
> URL: https://issues.apache.org/jira/browse/NUTCH-1749
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.7
>Reporter: Greg Padiasek
>Priority: Major
> Fix For: 1.18
>
> Attachments: DOMContentUtils.patch
>
>
> The HTML parser plugin inserts document title into document content. Since 
> the title alone can be retrieved via DOMContentUtils.getTitle() and content 
> is retrieved via DOMContentUtils.getText(), there is no need to duplicate 
> title in the content. When title is included in the content it becomes 
> difficult/impossible to extract document body without title. A need to 
> extract document body without title is visible when user wants to index or 
> display body and title separately.
> Attached is a patch which prevents including title in document content in the 
> HTML parser plugin.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (NUTCH-1749) Optionally exclude title from content field

2021-01-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-1749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17262029#comment-17262029
 ] 

ASF GitHub Bot commented on NUTCH-1749:
---

lewismc commented on pull request #285:
URL: https://github.com/apache/nutch/pull/285#issuecomment-756976004


   Hi @steeveb972 this seems like a really useful patch. Are you able to update 
it and we try to get it into master?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Optionally exclude title from content field
> ---
>
> Key: NUTCH-1749
> URL: https://issues.apache.org/jira/browse/NUTCH-1749
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.7
>Reporter: Greg Padiasek
>Priority: Major
> Fix For: 1.18
>
> Attachments: DOMContentUtils.patch
>
>
> The HTML parser plugin inserts document title into document content. Since 
> the title alone can be retrieved via DOMContentUtils.getTitle() and content 
> is retrieved via DOMContentUtils.getText(), there is no need to duplicate 
> title in the content. When title is included in the content it becomes 
> difficult/impossible to extract document body without title. A need to 
> extract document body without title is visible when user wants to index or 
> display body and title separately.
> Attached is a patch which prevents including title in document content in the 
> HTML parser plugin.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (NUTCH-1749) Optionally exclude title from content field

2021-01-08 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-1749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17261567#comment-17261567
 ] 

ASF GitHub Bot commented on NUTCH-1749:
---

lewismc commented on pull request #285:
URL: https://github.com/apache/nutch/pull/285#issuecomment-756976004


   Hi @steeveb972 this seems like a really useful patch. Are you able to update 
it and we try to get it into master?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Optionally exclude title from content field
> ---
>
> Key: NUTCH-1749
> URL: https://issues.apache.org/jira/browse/NUTCH-1749
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.7
>Reporter: Greg Padiasek
>Priority: Major
> Fix For: 1.18
>
> Attachments: DOMContentUtils.patch
>
>
> The HTML parser plugin inserts document title into document content. Since 
> the title alone can be retrieved via DOMContentUtils.getTitle() and content 
> is retrieved via DOMContentUtils.getText(), there is no need to duplicate 
> title in the content. When title is included in the content it becomes 
> difficult/impossible to extract document body without title. A need to 
> extract document body without title is visible when user wants to index or 
> display body and title separately.
> Attached is a patch which prevents including title in document content in the 
> HTML parser plugin.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (NUTCH-1749) Optionally exclude title from content field

2019-09-04 Thread Jorge Luis Betancourt Gonzalez (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-1749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16922388#comment-16922388
 ] 

Jorge Luis Betancourt Gonzalez commented on NUTCH-1749:
---

Do we want to put this into the upcoming release? I've added some comments to 
the PR. But will take a closer look at a later time.

> Optionally exclude title from content field
> ---
>
> Key: NUTCH-1749
> URL: https://issues.apache.org/jira/browse/NUTCH-1749
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.7
>Reporter: Greg Padiasek
>Priority: Major
> Fix For: 1.16
>
> Attachments: DOMContentUtils.patch
>
>
> The HTML parser plugin inserts document title into document content. Since 
> the title alone can be retrieved via DOMContentUtils.getTitle() and content 
> is retrieved via DOMContentUtils.getText(), there is no need to duplicate 
> title in the content. When title is included in the content it becomes 
> difficult/impossible to extract document body without title. A need to 
> extract document body without title is visible when user wants to index or 
> display body and title separately.
> Attached is a patch which prevents including title in document content in the 
> HTML parser plugin.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (NUTCH-1749) Optionally exclude title from content field

2019-09-04 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-1749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16922386#comment-16922386
 ] 

ASF GitHub Bot commented on NUTCH-1749:
---

jorgelbg commented on pull request #285: fix for NUTCH-1749 contributed by 
steeveb972
URL: https://github.com/apache/nutch/pull/285#discussion_r320669571
 
 

 ##
 File path: 
src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java
 ##
 @@ -116,36 +114,50 @@ public void setConf(Configuration conf) {
* 
* @return true if nested anchors were found
*/
-  public boolean getText(StringBuffer sb, Node node,
-  boolean abortOnNestedAnchors) {
-if (getTextHelper(sb, node, abortOnNestedAnchors, 0)) {
+  private boolean getText(StringBuffer sb, Node node,
+  boolean abortOnNestedAnchors, Set 
excludedElementNames) {
 
 Review comment:
   Formatting
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Optionally exclude title from content field
> ---
>
> Key: NUTCH-1749
> URL: https://issues.apache.org/jira/browse/NUTCH-1749
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.7
>Reporter: Greg Padiasek
>Priority: Major
> Fix For: 1.16
>
> Attachments: DOMContentUtils.patch
>
>
> The HTML parser plugin inserts document title into document content. Since 
> the title alone can be retrieved via DOMContentUtils.getTitle() and content 
> is retrieved via DOMContentUtils.getText(), there is no need to duplicate 
> title in the content. When title is included in the content it becomes 
> difficult/impossible to extract document body without title. A need to 
> extract document body without title is visible when user wants to index or 
> display body and title separately.
> Attached is a patch which prevents including title in document content in the 
> HTML parser plugin.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (NUTCH-1749) Optionally exclude title from content field

2019-09-04 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-1749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16922384#comment-16922384
 ] 

ASF GitHub Bot commented on NUTCH-1749:
---

jorgelbg commented on pull request #285: fix for NUTCH-1749 contributed by 
steeveb972
URL: https://github.com/apache/nutch/pull/285#discussion_r320669382
 
 

 ##
 File path: 
src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java
 ##
 @@ -102,10 +99,11 @@ public void setConf(Configuration conf) {
   }
 
   /**
-   * This method takes a {@link StringBuffer} and a DOM {@link Node}, and will
+   * This method takes a {@link StringBuffer}, a DOM {@link Node}
+   * and an excluded element {@link Set}, and will
* append all the content text found beneath the DOM node to the
-   * StringBuffer.
-   * 
+   * StringBuffer without the mentioned element names in the 
Set.
 
 Review comment:
   We return the textual content, not the element names. It should be something:
   > without the text from the excluded elements
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Optionally exclude title from content field
> ---
>
> Key: NUTCH-1749
> URL: https://issues.apache.org/jira/browse/NUTCH-1749
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.7
>Reporter: Greg Padiasek
>Priority: Major
> Fix For: 1.16
>
> Attachments: DOMContentUtils.patch
>
>
> The HTML parser plugin inserts document title into document content. Since 
> the title alone can be retrieved via DOMContentUtils.getTitle() and content 
> is retrieved via DOMContentUtils.getText(), there is no need to duplicate 
> title in the content. When title is included in the content it becomes 
> difficult/impossible to extract document body without title. A need to 
> extract document body without title is visible when user wants to index or 
> display body and title separately.
> Attached is a patch which prevents including title in document content in the 
> HTML parser plugin.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (NUTCH-1749) Optionally exclude title from content field

2019-09-04 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-1749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16922387#comment-16922387
 ] 

ASF GitHub Bot commented on NUTCH-1749:
---

jorgelbg commented on pull request #285: fix for NUTCH-1749 contributed by 
steeveb972
URL: https://github.com/apache/nutch/pull/285#discussion_r320699825
 
 

 ##
 File path: 
src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
 ##
 @@ -107,50 +103,65 @@ public void setConf(Configuration conf) {
   }
 
   /**
-   * This method takes a {@link StringBuffer} and a DOM {@link Node}, and will
+   * This method takes a {@link StringBuffer}, a DOM {@link Node}
+   * and an excluded element {@link Set}, and will
* append all the content text found beneath the DOM node to the
-   * StringBuffer.
-   * 
+   * StringBuffer without the mentioned element names in the 
Set.
 
 Review comment:
   Same as the previous comment.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Optionally exclude title from content field
> ---
>
> Key: NUTCH-1749
> URL: https://issues.apache.org/jira/browse/NUTCH-1749
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.7
>Reporter: Greg Padiasek
>Priority: Major
> Fix For: 1.16
>
> Attachments: DOMContentUtils.patch
>
>
> The HTML parser plugin inserts document title into document content. Since 
> the title alone can be retrieved via DOMContentUtils.getTitle() and content 
> is retrieved via DOMContentUtils.getText(), there is no need to duplicate 
> title in the content. When title is included in the content it becomes 
> difficult/impossible to extract document body without title. A need to 
> extract document body without title is visible when user wants to index or 
> display body and title separately.
> Attached is a patch which prevents including title in document content in the 
> HTML parser plugin.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (NUTCH-1749) Optionally exclude title from content field

2019-09-04 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-1749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16922385#comment-16922385
 ] 

ASF GitHub Bot commented on NUTCH-1749:
---

jorgelbg commented on pull request #285: fix for NUTCH-1749 contributed by 
steeveb972
URL: https://github.com/apache/nutch/pull/285#discussion_r320699366
 
 

 ##
 File path: 
src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java
 ##
 @@ -116,36 +114,50 @@ public void setConf(Configuration conf) {
* 
* @return true if nested anchors were found
*/
-  public boolean getText(StringBuffer sb, Node node,
-  boolean abortOnNestedAnchors) {
-if (getTextHelper(sb, node, abortOnNestedAnchors, 0)) {
+  private boolean getText(StringBuffer sb, Node node,
+  boolean abortOnNestedAnchors, Set 
excludedElementNames) {
+if (getTextHelper(sb, node, abortOnNestedAnchors, 0, 
excludedElementNames)) {
   return true;
 }
 return false;
   }
 
   /**
* This is a convinience method, equivalent to
-   * {@link #getText(StringBuffer,Node,boolean) getText(sb, node, false)}.
+   * {@link #getText(StringBuffer,Node,boolean, Set) getText(sb, node, false, 
excludedElementNames)}.
* 
*/
-  public void getText(StringBuffer sb, Node node) {
-getText(sb, node, false);
+  public void getText(StringBuffer sb, Node node, Set 
excludedElementNames) {
+getText(sb, node, false, excludedElementNames);
   }
 
   // returns true if abortOnNestedAnchors is true and we find nested
   // anchors
   private boolean getTextHelper(StringBuffer sb, Node node,
-  boolean abortOnNestedAnchors, int anchorDepth) {
+boolean abortOnNestedAnchors, int anchorDepth, 
Set excludedElementNames) {
 boolean abort = false;
 NodeWalker walker = new NodeWalker(node);
+Set lcExcludedElementNames = new HashSet<>();
 
 Review comment:
   We should avoid duplicating the exclusion set. This method is executed many 
times. We could use a `TreeSet` when it is invoked, delegating the comparison 
to the `Set` implementation.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Optionally exclude title from content field
> ---
>
> Key: NUTCH-1749
> URL: https://issues.apache.org/jira/browse/NUTCH-1749
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.7
>Reporter: Greg Padiasek
>Priority: Major
> Fix For: 1.16
>
> Attachments: DOMContentUtils.patch
>
>
> The HTML parser plugin inserts document title into document content. Since 
> the title alone can be retrieved via DOMContentUtils.getTitle() and content 
> is retrieved via DOMContentUtils.getText(), there is no need to duplicate 
> title in the content. When title is included in the content it becomes 
> difficult/impossible to extract document body without title. A need to 
> extract document body without title is visible when user wants to index or 
> display body and title separately.
> Attached is a patch which prevents including title in document content in the 
> HTML parser plugin.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (NUTCH-1749) Optionally exclude title from content field

2019-09-04 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-1749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16922383#comment-16922383
 ] 

ASF GitHub Bot commented on NUTCH-1749:
---

jorgelbg commented on pull request #285: fix for NUTCH-1749 contributed by 
steeveb972
URL: https://github.com/apache/nutch/pull/285#discussion_r320663834
 
 

 ##
 File path: build.xml
 ##
 @@ -1002,7 +1002,7 @@
 
   
   
-http://downloads.sourceforge.net/project/ant-eclipse/ant-eclipse/1.0/ant-eclipse-1.0.bin.tar.bz2";
+http://freefr.dl.sourceforge.net/project/ant-eclipse/ant-eclipse/1.0/ant-eclipse-1.0.bin.tar.bz2";
 
 Review comment:
   I don't see a good reason to change this URL.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Optionally exclude title from content field
> ---
>
> Key: NUTCH-1749
> URL: https://issues.apache.org/jira/browse/NUTCH-1749
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.7
>Reporter: Greg Padiasek
>Priority: Major
> Fix For: 1.16
>
> Attachments: DOMContentUtils.patch
>
>
> The HTML parser plugin inserts document title into document content. Since 
> the title alone can be retrieved via DOMContentUtils.getTitle() and content 
> is retrieved via DOMContentUtils.getText(), there is no need to duplicate 
> title in the content. When title is included in the content it becomes 
> difficult/impossible to extract document body without title. A need to 
> extract document body without title is visible when user wants to index or 
> display body and title separately.
> Attached is a patch which prevents including title in document content in the 
> HTML parser plugin.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (NUTCH-1749) Optionally exclude title from content field

2018-07-02 Thread Sebastian Nagel (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16530026#comment-16530026
 ] 

Sebastian Nagel commented on NUTCH-1749:


PR needs fixes to be merged, moving to 1.16

> Optionally exclude title from content field
> ---
>
> Key: NUTCH-1749
> URL: https://issues.apache.org/jira/browse/NUTCH-1749
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.7
>Reporter: Greg Padiasek
>Priority: Major
> Fix For: 1.16
>
> Attachments: DOMContentUtils.patch
>
>
> The HTML parser plugin inserts document title into document content. Since 
> the title alone can be retrieved via DOMContentUtils.getTitle() and content 
> is retrieved via DOMContentUtils.getText(), there is no need to duplicate 
> title in the content. When title is included in the content it becomes 
> difficult/impossible to extract document body without title. A need to 
> extract document body without title is visible when user wants to index or 
> display body and title separately.
> Attached is a patch which prevents including title in document content in the 
> HTML parser plugin.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-1749) Optionally exclude title from content field

2018-02-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16361536#comment-16361536
 ] 

ASF GitHub Bot commented on NUTCH-1749:
---

steeveb972 opened a new pull request #285: fix for NUTCH-1749 contributed by 
steeveb972
URL: https://github.com/apache/nutch/pull/285
 
 
   Hello,
   
   Following the description and the comment provided in this jira, I propose a 
patch to add the ability to exclude some tags in HTML and Tika parsers.
   
   Regards,
   
   Steeve


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Optionally exclude title from content field
> ---
>
> Key: NUTCH-1749
> URL: https://issues.apache.org/jira/browse/NUTCH-1749
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.7
>Reporter: Greg Padiasek
>Priority: Major
> Fix For: 1.15
>
> Attachments: DOMContentUtils.patch
>
>
> The HTML parser plugin inserts document title into document content. Since 
> the title alone can be retrieved via DOMContentUtils.getTitle() and content 
> is retrieved via DOMContentUtils.getText(), there is no need to duplicate 
> title in the content. When title is included in the content it becomes 
> difficult/impossible to extract document body without title. A need to 
> extract document body without title is visible when user wants to index or 
> display body and title separately.
> Attached is a patch which prevents including title in document content in the 
> HTML parser plugin.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)