Re: [jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata
Doug Cutting (JIRA) wrote: [ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12364125 ] My apologies for commenting here - JIRA produces broken HTML for me, I can't use it... Doug Cutting commented on NUTCH-139: I think we're near agreement here. Here are the changes I think this patch still needs: MetadataNames belongs in the protocol package, not util. Erhm.. please bear with me. I'd rather see these two classes in a separate package altogether, org.apache.nutch.metadata. The reason is that most likely these two classes will be used elsewhere too, not just in the protocol and parse/fetch related context. I'm specifically referring to the CrawlData. We should rename ContentProperties to Metadata. +1. We should add an add() method to Metadata, and change set() to replace all values rather than add a new value. Protocol code which creates properties from headers should then use add(). +1 We could commit after simply moving MetadataNames to protocol, and leave the changes to ContentProperties for another commit, but I'd prefer it all be done together. Either way is fine with me. Perhaps splitting this into two commits would make it easier to fix potential breakage... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: [jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata
Andrzej Bialecki wrote: Erhm.. please bear with me. I'd rather see these two classes in a separate package altogether, org.apache.nutch.metadata. The reason is that most likely these two classes will be used elsewhere too, not just in the protocol and parse/fetch related context. I'm specifically referring to the CrawlData. +1 Doug
RE: [jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata
Guys, My apologies for the spamming comments -- I tried to submit my comment through JIRA one time and it kept giving me service unavailable. So I resubmitted like 5 times, on the fifth time it finally went through -- but I guess the other comments went through too. I'll try and remove them right away. Sorry again. Cheers, Chris __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology. -Original Message- From: Doug Cutting (JIRA) [mailto:[EMAIL PROTECTED] Sent: Thursday, January 05, 2006 8:04 PM To: nutch-dev@incubator.apache.org Subject: [jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata [ http://issues.apache.org/jira/browse/NUTCH- 139?page=comments#action_12361922 ] Doug Cutting commented on NUTCH-139: One more thing. Content length should also not need to be stored in the metadata as an x-nutch value. The content length is simply the length of the Content's data. The protocol may have truncated the content, in which case perhaps we need an x-nutch-truncated-content metadata property or something, but we should not be overwriting the HTTP Content-Length header, nor should we trust that it reflects the length of the data actually fetched. Standard metadata property names in the ParseData metadata -- Key: NUTCH-139 URL: http://issues.apache.org/jira/browse/NUTCH-139 Project: Nutch Type: Improvement Components: fetcher Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB RAM, although bug is independent of environment Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Priority: Minor Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6 Attachments: NUTCH-139.060105.patch, NUTCH-139.Mattmann.patch.txt, NUTCH-139.jc.review.patch.txt Currently, people are free to name their string-based properties anything that they want, such as having names of Content-type, content- TyPe, CONTENT_TYPE all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that CONTENT_TYPE and conTeNT_TyPE and all the permutations are really the same). What about if I named it Content Type, or ContentType? I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as Content-type, Creator, Language, etc. The properties would be defined at the top of the ParseData class, something like: public class ParseData{ . public static final String CONTENT_TYPE = content-type; public static final String CREATOR = creator; } In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, text/xml); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named. I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
RE: [jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata
Guys, My apologies for the spamming comments -- I tried to submit my comment through JIRA one time and it kept giving me service unavailable. So I resubmitted like 5 times, on the fifth time it finally went through -- but I guess the other comments went through too. I'll try and remove them right away. Sorry again. Cheers, Chris __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology. -Original Message- From: Doug Cutting (JIRA) [mailto:[EMAIL PROTECTED] Sent: Thursday, January 05, 2006 8:04 PM To: nutch-dev@incubator.apache.org Subject: [jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata [ http://issues.apache.org/jira/browse/NUTCH- 139?page=comments#action_12361922 ] Doug Cutting commented on NUTCH-139: One more thing. Content length should also not need to be stored in the metadata as an x-nutch value. The content length is simply the length of the Content's data. The protocol may have truncated the content, in which case perhaps we need an x-nutch-truncated-content metadata property or something, but we should not be overwriting the HTTP Content-Length header, nor should we trust that it reflects the length of the data actually fetched. Standard metadata property names in the ParseData metadata -- Key: NUTCH-139 URL: http://issues.apache.org/jira/browse/NUTCH-139 Project: Nutch Type: Improvement Components: fetcher Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB RAM, although bug is independent of environment Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Priority: Minor Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6 Attachments: NUTCH-139.060105.patch, NUTCH-139.Mattmann.patch.txt, NUTCH-139.jc.review.patch.txt Currently, people are free to name their string-based properties anything that they want, such as having names of Content-type, content- TyPe, CONTENT_TYPE all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that CONTENT_TYPE and conTeNT_TyPE and all the permutations are really the same). What about if I named it Content Type, or ContentType? I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as Content-type, Creator, Language, etc. The properties would be defined at the top of the ParseData class, something like: public class ParseData{ . public static final String CONTENT_TYPE = content-type; public static final String CREATOR = creator; } In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, text/xml); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named. I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira