[jira] Updated: (NUTCH-406) Metadata tries to write null values
[ http://issues.apache.org/jira/browse/NUTCH-406?page=all ] Doğacan Güney updated NUTCH-406: Attachment: NUTCH-406.patch A simple patch that writes nulls as empty strings. Metadata tries to write null values --- Key: NUTCH-406 URL: http://issues.apache.org/jira/browse/NUTCH-406 Project: Nutch Issue Type: Bug Affects Versions: 0.9.0 Reporter: Doğacan Güney Attachments: NUTCH-406.patch During parsing, some urls (especially pdfs, it seems) may create some_key, null pairs in ParseData's parseMeta. When Metadata.write() tries to write such a pair, it causes an NPE. Stack trace will be something like this: at org.apache.hadoop.io.Text.encode(Text.java:373) at org.apache.hadoop.io.Text.encode(Text.java:354) at org.apache.hadoop.io.Text.writeString(Text.java:394) at org.apache.nutch.metadata.Metadata.write(Metadata.java:214) I can consistently reproduce this using the following url: http://www.efesbev.com/corporate_governance/pdf/MergerAgreement.pdf -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-406) Metadata tries to write null values
[ http://issues.apache.org/jira/browse/NUTCH-406?page=all ] Chris A. Mattmann updated NUTCH-406: Assignee: Chris A. Mattmann Metadata tries to write null values --- Key: NUTCH-406 URL: http://issues.apache.org/jira/browse/NUTCH-406 Project: Nutch Issue Type: Bug Affects Versions: 0.9.0 Reporter: Doğacan Güney Assigned To: Chris A. Mattmann Attachments: NUTCH-406.patch During parsing, some urls (especially pdfs, it seems) may create some_key, null pairs in ParseData's parseMeta. When Metadata.write() tries to write such a pair, it causes an NPE. Stack trace will be something like this: at org.apache.hadoop.io.Text.encode(Text.java:373) at org.apache.hadoop.io.Text.encode(Text.java:354) at org.apache.hadoop.io.Text.writeString(Text.java:394) at org.apache.nutch.metadata.Metadata.write(Metadata.java:214) I can consistently reproduce this using the following url: http://www.efesbev.com/corporate_governance/pdf/MergerAgreement.pdf -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-406) Metadata tries to write null values
[ http://issues.apache.org/jira/browse/NUTCH-406?page=all ] Doğacan Güney updated NUTCH-406: Attachment: NUTCH-406.patch How about something like this then? Metadata tries to write null values --- Key: NUTCH-406 URL: http://issues.apache.org/jira/browse/NUTCH-406 Project: Nutch Issue Type: Bug Affects Versions: 0.9.0 Reporter: Doğacan Güney Assigned To: Chris A. Mattmann Attachments: NUTCH-406.patch, NUTCH-406.patch During parsing, some urls (especially pdfs, it seems) may create some_key, null pairs in ParseData's parseMeta. When Metadata.write() tries to write such a pair, it causes an NPE. Stack trace will be something like this: at org.apache.hadoop.io.Text.encode(Text.java:373) at org.apache.hadoop.io.Text.encode(Text.java:354) at org.apache.hadoop.io.Text.writeString(Text.java:394) at org.apache.nutch.metadata.Metadata.write(Metadata.java:214) I can consistently reproduce this using the following url: http://www.efesbev.com/corporate_governance/pdf/MergerAgreement.pdf -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira