[jira] Updated: (NUTCH-936) LanguageIdentifier should not set empty lang field on NutchDocument

2010-11-22 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-936:


Patch Info: [Patch Available]

 LanguageIdentifier should not set empty lang field on NutchDocument
 ---

 Key: NUTCH-936
 URL: https://issues.apache.org/jira/browse/NUTCH-936
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.2
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.3, 2.0

 Attachments: NUTCH-936-v12-1.patch, NUTCH-936-v13-1.patch, 
 NUTCH-936-v13-1.patch


 For some reason the language identifier plugin sometimes sets an empty value 
 for the lang field. It is confirmed to occur in 1.2 when parsing a scanned 
 PDF file which cannot be OCR'd to proper text, resulting in an empty content 
 field. Anyway, whether it's a problem with the parser or not, the plugin 
 itself should not add an empty value because the content field can always be 
 empty. The plugin already checks for a null value and then sets the lang 
 field to `unknown`, which is fine. But when the lang string is empty, it 
 should also be set to `unknown`.
 This might break clients that have conditional logic on the empty value, but 
 not on the `unknown` value because it may never have occurred in their set up 
 and therefore they might not have added `unknown` to their logic. However, it 
 might seem a little bit overkill to put this proposal behind a configuration 
 option and let Nutch by default continue to behave as it currently does. Any 
 thoughts on this one?
 Here's the troublesome URL : 
 http://www.nrc.nl/redactie/binnenland/memo_buza_irak.pdf that returns an 
 empty content field and an empty lang string in 1.2 and presumably in trunk 
 and other versions as well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-936) LanguageIdentifier should not set empty lang field on NutchDocument

2010-11-22 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-936:


Attachment: NUTCH-936-v13-1.patch
NUTCH-936-v13-1.patch
NUTCH-936-v12-1.patch

Here are patches for the current 1.2 stable, branch 1.3 and trunk. 

 LanguageIdentifier should not set empty lang field on NutchDocument
 ---

 Key: NUTCH-936
 URL: https://issues.apache.org/jira/browse/NUTCH-936
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.2
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.3, 2.0

 Attachments: NUTCH-936-v12-1.patch, NUTCH-936-v13-1.patch, 
 NUTCH-936-v13-1.patch


 For some reason the language identifier plugin sometimes sets an empty value 
 for the lang field. It is confirmed to occur in 1.2 when parsing a scanned 
 PDF file which cannot be OCR'd to proper text, resulting in an empty content 
 field. Anyway, whether it's a problem with the parser or not, the plugin 
 itself should not add an empty value because the content field can always be 
 empty. The plugin already checks for a null value and then sets the lang 
 field to `unknown`, which is fine. But when the lang string is empty, it 
 should also be set to `unknown`.
 This might break clients that have conditional logic on the empty value, but 
 not on the `unknown` value because it may never have occurred in their set up 
 and therefore they might not have added `unknown` to their logic. However, it 
 might seem a little bit overkill to put this proposal behind a configuration 
 option and let Nutch by default continue to behave as it currently does. Any 
 thoughts on this one?
 Here's the troublesome URL : 
 http://www.nrc.nl/redactie/binnenland/memo_buza_irak.pdf that returns an 
 empty content field and an empty lang string in 1.2 and presumably in trunk 
 and other versions as well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.