[jira] Updated: (NUTCH-936) LanguageIdentifier should not set empty lang field on NutchDocument
[ https://issues.apache.org/jira/browse/NUTCH-936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-936: Patch Info: [Patch Available] LanguageIdentifier should not set empty lang field on NutchDocument --- Key: NUTCH-936 URL: https://issues.apache.org/jira/browse/NUTCH-936 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.2 Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Minor Fix For: 1.3, 2.0 Attachments: NUTCH-936-v12-1.patch, NUTCH-936-v13-1.patch, NUTCH-936-v13-1.patch For some reason the language identifier plugin sometimes sets an empty value for the lang field. It is confirmed to occur in 1.2 when parsing a scanned PDF file which cannot be OCR'd to proper text, resulting in an empty content field. Anyway, whether it's a problem with the parser or not, the plugin itself should not add an empty value because the content field can always be empty. The plugin already checks for a null value and then sets the lang field to `unknown`, which is fine. But when the lang string is empty, it should also be set to `unknown`. This might break clients that have conditional logic on the empty value, but not on the `unknown` value because it may never have occurred in their set up and therefore they might not have added `unknown` to their logic. However, it might seem a little bit overkill to put this proposal behind a configuration option and let Nutch by default continue to behave as it currently does. Any thoughts on this one? Here's the troublesome URL : http://www.nrc.nl/redactie/binnenland/memo_buza_irak.pdf that returns an empty content field and an empty lang string in 1.2 and presumably in trunk and other versions as well. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-936) LanguageIdentifier should not set empty lang field on NutchDocument
[ https://issues.apache.org/jira/browse/NUTCH-936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-936: Attachment: NUTCH-936-v13-1.patch NUTCH-936-v13-1.patch NUTCH-936-v12-1.patch Here are patches for the current 1.2 stable, branch 1.3 and trunk. LanguageIdentifier should not set empty lang field on NutchDocument --- Key: NUTCH-936 URL: https://issues.apache.org/jira/browse/NUTCH-936 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.2 Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Minor Fix For: 1.3, 2.0 Attachments: NUTCH-936-v12-1.patch, NUTCH-936-v13-1.patch, NUTCH-936-v13-1.patch For some reason the language identifier plugin sometimes sets an empty value for the lang field. It is confirmed to occur in 1.2 when parsing a scanned PDF file which cannot be OCR'd to proper text, resulting in an empty content field. Anyway, whether it's a problem with the parser or not, the plugin itself should not add an empty value because the content field can always be empty. The plugin already checks for a null value and then sets the lang field to `unknown`, which is fine. But when the lang string is empty, it should also be set to `unknown`. This might break clients that have conditional logic on the empty value, but not on the `unknown` value because it may never have occurred in their set up and therefore they might not have added `unknown` to their logic. However, it might seem a little bit overkill to put this proposal behind a configuration option and let Nutch by default continue to behave as it currently does. Any thoughts on this one? Here's the troublesome URL : http://www.nrc.nl/redactie/binnenland/memo_buza_irak.pdf that returns an empty content field and an empty lang string in 1.2 and presumably in trunk and other versions as well. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.