[jira] [Updated] (NUTCH-1749) Optionally exclude title from content field

2019-10-01 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-1749:
---
Fix Version/s: (was: 1.16)
   1.17

> Optionally exclude title from content field
> ---
>
> Key: NUTCH-1749
> URL: https://issues.apache.org/jira/browse/NUTCH-1749
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.7
>Reporter: Greg Padiasek
>Priority: Major
> Fix For: 1.17
>
> Attachments: DOMContentUtils.patch
>
>
> The HTML parser plugin inserts document title into document content. Since 
> the title alone can be retrieved via DOMContentUtils.getTitle() and content 
> is retrieved via DOMContentUtils.getText(), there is no need to duplicate 
> title in the content. When title is included in the content it becomes 
> difficult/impossible to extract document body without title. A need to 
> extract document body without title is visible when user wants to index or 
> display body and title separately.
> Attached is a patch which prevents including title in document content in the 
> HTML parser plugin.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (NUTCH-1749) Optionally exclude title from content field

2018-07-02 Thread Sebastian Nagel (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-1749:
---
Fix Version/s: (was: 1.15)
   1.16

> Optionally exclude title from content field
> ---
>
> Key: NUTCH-1749
> URL: https://issues.apache.org/jira/browse/NUTCH-1749
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.7
>Reporter: Greg Padiasek
>Priority: Major
> Fix For: 1.16
>
> Attachments: DOMContentUtils.patch
>
>
> The HTML parser plugin inserts document title into document content. Since 
> the title alone can be retrieved via DOMContentUtils.getTitle() and content 
> is retrieved via DOMContentUtils.getText(), there is no need to duplicate 
> title in the content. When title is included in the content it becomes 
> difficult/impossible to extract document body without title. A need to 
> extract document body without title is visible when user wants to index or 
> display body and title separately.
> Attached is a patch which prevents including title in document content in the 
> HTML parser plugin.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (NUTCH-1749) Optionally exclude title from content field

2017-12-17 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-1749:
---
Fix Version/s: (was: 1.14)
   1.15

> Optionally exclude title from content field
> ---
>
> Key: NUTCH-1749
> URL: https://issues.apache.org/jira/browse/NUTCH-1749
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.7
>Reporter: Greg Padiasek
> Fix For: 1.15
>
> Attachments: DOMContentUtils.patch
>
>
> The HTML parser plugin inserts document title into document content. Since 
> the title alone can be retrieved via DOMContentUtils.getTitle() and content 
> is retrieved via DOMContentUtils.getText(), there is no need to duplicate 
> title in the content. When title is included in the content it becomes 
> difficult/impossible to extract document body without title. A need to 
> extract document body without title is visible when user wants to index or 
> display body and title separately.
> Attached is a patch which prevents including title in document content in the 
> HTML parser plugin.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (NUTCH-1749) Optionally exclude title from content field

2016-09-13 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-1749:
---
Summary: Optionally exclude title from content field  (was: Title 
duplicated in document body)

> Optionally exclude title from content field
> ---
>
> Key: NUTCH-1749
> URL: https://issues.apache.org/jira/browse/NUTCH-1749
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.7
>Reporter: Greg Padiasek
> Fix For: 1.13
>
> Attachments: DOMContentUtils.patch
>
>
> The HTML parser plugin inserts document title into document content. Since 
> the title alone can be retrieved via DOMContentUtils.getTitle() and content 
> is retrieved via DOMContentUtils.getText(), there is no need to duplicate 
> title in the content. When title is included in the content it becomes 
> difficult/impossible to extract document body without title. A need to 
> extract document body without title is visible when user wants to index or 
> display body and title separately.
> Attached is a patch which prevents including title in document content in the 
> HTML parser plugin.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)