[jira] Commented: (NUTCH-406) Metadata tries to write null values
[ http://issues.apache.org/jira/browse/NUTCH-406?page=comments#action_12452270 ] Andrzej Bialecki commented on NUTCH-406: - Null value is not equivalent to an empty String - perhaps we should simply skip such values. Metadata tries to write null values --- Key: NUTCH-406 URL: http://issues.apache.org/jira/browse/NUTCH-406 Project: Nutch Issue Type: Bug Affects Versions: 0.9.0 Reporter: Doğacan Güney Assigned To: Chris A. Mattmann Attachments: NUTCH-406.patch During parsing, some urls (especially pdfs, it seems) may create some_key, null pairs in ParseData's parseMeta. When Metadata.write() tries to write such a pair, it causes an NPE. Stack trace will be something like this: at org.apache.hadoop.io.Text.encode(Text.java:373) at org.apache.hadoop.io.Text.encode(Text.java:354) at org.apache.hadoop.io.Text.writeString(Text.java:394) at org.apache.nutch.metadata.Metadata.write(Metadata.java:214) I can consistently reproduce this using the following url: http://www.efesbev.com/corporate_governance/pdf/MergerAgreement.pdf -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-406) Metadata tries to write null values
[ http://issues.apache.org/jira/browse/NUTCH-406?page=comments#action_12452275 ] Chris A. Mattmann commented on NUTCH-406: - Hi Andrzej, Doğacan, +1. I think it makes a lot of sense to just not include the null key in the Met container. Doğacan, in the future, when you attach a new version of a patch for a JIRA issue, please indicate the change by renaming the patch. Not a big deal, but good style points ;) I'll commit this patch shortly. Cheers, Chris Metadata tries to write null values --- Key: NUTCH-406 URL: http://issues.apache.org/jira/browse/NUTCH-406 Project: Nutch Issue Type: Bug Affects Versions: 0.9.0 Reporter: Doğacan Güney Assigned To: Chris A. Mattmann Attachments: NUTCH-406.patch, NUTCH-406.patch During parsing, some urls (especially pdfs, it seems) may create some_key, null pairs in ParseData's parseMeta. When Metadata.write() tries to write such a pair, it causes an NPE. Stack trace will be something like this: at org.apache.hadoop.io.Text.encode(Text.java:373) at org.apache.hadoop.io.Text.encode(Text.java:354) at org.apache.hadoop.io.Text.writeString(Text.java:394) at org.apache.nutch.metadata.Metadata.write(Metadata.java:214) I can consistently reproduce this using the following url: http://www.efesbev.com/corporate_governance/pdf/MergerAgreement.pdf -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-406) Metadata tries to write null values
[ http://issues.apache.org/jira/browse/NUTCH-406?page=comments#action_12452282 ] Andrzej Bialecki commented on NUTCH-406: - Erhm, -1 from me. This code checks only if the first value is null, and then discards all other values (which may be non-null), thus we could lose valuable data if only the first value happens to be null ... I think we should indeed check if the first value is null, but then if it is then loop over all other values, count non-nulls, and if the count 0 then write out the key, non-null values set. Metadata tries to write null values --- Key: NUTCH-406 URL: http://issues.apache.org/jira/browse/NUTCH-406 Project: Nutch Issue Type: Bug Affects Versions: 0.9.0 Reporter: Doğacan Güney Assigned To: Chris A. Mattmann Attachments: NUTCH-406.patch, NUTCH-406.patch During parsing, some urls (especially pdfs, it seems) may create some_key, null pairs in ParseData's parseMeta. When Metadata.write() tries to write such a pair, it causes an NPE. Stack trace will be something like this: at org.apache.hadoop.io.Text.encode(Text.java:373) at org.apache.hadoop.io.Text.encode(Text.java:354) at org.apache.hadoop.io.Text.writeString(Text.java:394) at org.apache.nutch.metadata.Metadata.write(Metadata.java:214) I can consistently reproduce this using the following url: http://www.efesbev.com/corporate_governance/pdf/MergerAgreement.pdf -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-406) Metadata tries to write null values
[ http://issues.apache.org/jira/browse/NUTCH-406?page=comments#action_12452285 ] Chris A. Mattmann commented on NUTCH-406: - Hi Doğacan, Loooking at your latest patch, I'm not sure that it completely does the right behavior. For example, what happens if there are 3 met values for a key k, and one of them is null, but the other 2 are not? Specifically, what if the first value is null, but the other 2 are not. In that case, your patch would skip over writing all of the keys. Wouldn't it just be easier to do something like this? Index: src/java/org/apache/nutch/metadata/Metadata.java === --- src/java/org/apache/nutch/metadata/Metadata.java(revision 478613) +++ src/java/org/apache/nutch/metadata/Metadata.java(working copy) @@ -211,7 +211,9 @@ values = getValues(names[i]); out.writeInt(values.length); for (int j = 0; j values.length; j++) { -Text.writeString(out, values[j]); +if(values[j] != null !values[j].equals()){ + Text.writeString(out, values[j]); +} } } } Metadata tries to write null values --- Key: NUTCH-406 URL: http://issues.apache.org/jira/browse/NUTCH-406 Project: Nutch Issue Type: Bug Affects Versions: 0.9.0 Reporter: Doğacan Güney Assigned To: Chris A. Mattmann Attachments: NUTCH-406.patch, NUTCH-406.patch During parsing, some urls (especially pdfs, it seems) may create some_key, null pairs in ParseData's parseMeta. When Metadata.write() tries to write such a pair, it causes an NPE. Stack trace will be something like this: at org.apache.hadoop.io.Text.encode(Text.java:373) at org.apache.hadoop.io.Text.encode(Text.java:354) at org.apache.hadoop.io.Text.writeString(Text.java:394) at org.apache.nutch.metadata.Metadata.write(Metadata.java:214) I can consistently reproduce this using the following url: http://www.efesbev.com/corporate_governance/pdf/MergerAgreement.pdf -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-406) Metadata tries to write null values
[ http://issues.apache.org/jira/browse/NUTCH-406?page=comments#action_12452286 ] Chris A. Mattmann commented on NUTCH-406: - Hi Andrzej, Yup, you caught the same thing as me. +1 for your solution. I will extend my above patch by writing getNumNonNullValues(values) instead of values.length. Cheers, Chris Metadata tries to write null values --- Key: NUTCH-406 URL: http://issues.apache.org/jira/browse/NUTCH-406 Project: Nutch Issue Type: Bug Affects Versions: 0.9.0 Reporter: Doğacan Güney Assigned To: Chris A. Mattmann Attachments: NUTCH-406.patch, NUTCH-406.patch During parsing, some urls (especially pdfs, it seems) may create some_key, null pairs in ParseData's parseMeta. When Metadata.write() tries to write such a pair, it causes an NPE. Stack trace will be something like this: at org.apache.hadoop.io.Text.encode(Text.java:373) at org.apache.hadoop.io.Text.encode(Text.java:354) at org.apache.hadoop.io.Text.writeString(Text.java:394) at org.apache.nutch.metadata.Metadata.write(Metadata.java:214) I can consistently reproduce this using the following url: http://www.efesbev.com/corporate_governance/pdf/MergerAgreement.pdf -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira