Re: [jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

2006-01-26 Thread Andrzej Bialecki

Doug Cutting (JIRA) wrote:
[ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12364125 ] 

  


My apologies for commenting here - JIRA produces broken HTML for me, I 
can't use it...



Doug Cutting commented on NUTCH-139:


I think we're near agreement here.

Here are the changes I think this patch still needs:

MetadataNames belongs in the protocol package, not util.
  


Erhm.. please bear with me. I'd rather see these two classes in a 
separate package altogether, org.apache.nutch.metadata. The reason is 
that most likely these two classes will be used elsewhere too, not just 
in the protocol and parse/fetch related context. I'm specifically 
referring to the CrawlData.



We should rename ContentProperties to Metadata.
  


+1.


We should add an add() method to Metadata, and change set() to replace all 
values rather than add a new value.  Protocol code which creates properties 
from headers should then use add().
  


+1


We could commit after simply moving MetadataNames to protocol, and leave the 
changes to ContentProperties for another commit, but I'd prefer it all be done 
together.
  


Either way is fine with me. Perhaps splitting this into two commits 
would make it easier to fix potential breakage...


--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




Re: [jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

2006-01-26 Thread Doug Cutting

Andrzej Bialecki wrote:
Erhm.. please bear with me. I'd rather see these two classes in a 
separate package altogether, org.apache.nutch.metadata. The reason is 
that most likely these two classes will be used elsewhere too, not just 
in the protocol and parse/fetch related context. I'm specifically 
referring to the CrawlData.


+1

Doug


RE: [jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

2006-01-05 Thread chris.mattmann
Guys,

 My apologies for the spamming comments -- I tried to submit my comment
through JIRA one time and it kept giving me service unavailable. So I
resubmitted like 5 times, on the fifth time it finally went through -- but I
guess the other comments went through too. I'll try and remove them right
away.

 Sorry again.

Cheers,
  Chris


__
Chris A. Mattmann
[EMAIL PROTECTED] 
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.


 -Original Message-
 From: Doug Cutting (JIRA) [mailto:[EMAIL PROTECTED]
 Sent: Thursday, January 05, 2006 8:04 PM
 To: nutch-dev@incubator.apache.org
 Subject: [jira] Commented: (NUTCH-139) Standard metadata property names in
 the ParseData metadata
 
 [ http://issues.apache.org/jira/browse/NUTCH-
 139?page=comments#action_12361922 ]
 
 Doug Cutting commented on NUTCH-139:
 
 
 One more thing.  Content length should also not need to be stored in the
 metadata as an x-nutch value.  The content length is simply the length of
 the Content's data.  The protocol may have truncated the content, in which
 case perhaps we need an x-nutch-truncated-content metadata property or
 something, but we should not be overwriting the HTTP Content-Length
 header, nor should we trust that it reflects the length of the data
 actually fetched.
 
 
  Standard metadata property names in the ParseData metadata
  --
 
   Key: NUTCH-139
   URL: http://issues.apache.org/jira/browse/NUTCH-139
   Project: Nutch
  Type: Improvement
Components: fetcher
  Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
   Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB
 RAM, although bug is independent of environment
  Reporter: Chris A. Mattmann
  Assignee: Chris A. Mattmann
  Priority: Minor
   Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
   Attachments: NUTCH-139.060105.patch, NUTCH-139.Mattmann.patch.txt,
 NUTCH-139.jc.review.patch.txt
 
  Currently, people are free to name their string-based properties
 anything that they want, such as having names of Content-type, content-
 TyPe, CONTENT_TYPE all having the same meaning. Stefan G. I believe
 proposed a solution in which all property names be converted to lower
 case, but in essence this really only fixes half the problem right (the
 case of identifying that CONTENT_TYPE
  and conTeNT_TyPE and all the permutations are really the same). What
 about
  if I named it Content Type, or ContentType?
   I propose that a way to correct this would be to create a standard set
 of named Strings in the ParseData class that the protocol framework and
 the parsing framework could use to identify common properties such as
 Content-type, Creator, Language, etc.
   The properties would be defined at the top of the ParseData class,
 something like:
   public class ParseData{
 .
  public static final String CONTENT_TYPE = content-type;
  public static final String CREATOR = creator;
 
  }
  In this fashion, users could at least know what the name of the standard
 properties that they can obtain from the ParseData are, for example by
 making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to
 get the content type or a call to
 ParseData.getMetadata().set(ParseData.CONTENT_TYPE, text/xml); Of
 course, this wouldn't preclude users from doing what they are currently
 doing, it would just provide a standard method of obtaining some of the
 more common, critical metadata without pouring over the code base to
 figure out what they are named.
  I'll contribute a patch near the end of the this week, or beg. of next
 week that addresses this issue.
 
 --
 This message is automatically generated by JIRA.
 -
 If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
 -
 For more information on JIRA, see:
http://www.atlassian.com/software/jira



RE: [jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

2006-01-05 Thread Chris Mattmann
Guys,

 My apologies for the spamming comments -- I tried to submit my comment
through JIRA one time and it kept giving me service unavailable. So I
resubmitted like 5 times, on the fifth time it finally went through -- but I
guess the other comments went through too. I'll try and remove them right
away.

 Sorry again.

Cheers,
  Chris


__
Chris A. Mattmann
[EMAIL PROTECTED] 
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.


 -Original Message-
 From: Doug Cutting (JIRA) [mailto:[EMAIL PROTECTED]
 Sent: Thursday, January 05, 2006 8:04 PM
 To: nutch-dev@incubator.apache.org
 Subject: [jira] Commented: (NUTCH-139) Standard metadata property names in
 the ParseData metadata
 
 [ http://issues.apache.org/jira/browse/NUTCH-
 139?page=comments#action_12361922 ]
 
 Doug Cutting commented on NUTCH-139:
 
 
 One more thing.  Content length should also not need to be stored in the
 metadata as an x-nutch value.  The content length is simply the length of
 the Content's data.  The protocol may have truncated the content, in which
 case perhaps we need an x-nutch-truncated-content metadata property or
 something, but we should not be overwriting the HTTP Content-Length
 header, nor should we trust that it reflects the length of the data
 actually fetched.
 
 
  Standard metadata property names in the ParseData metadata
  --
 
   Key: NUTCH-139
   URL: http://issues.apache.org/jira/browse/NUTCH-139
   Project: Nutch
  Type: Improvement
Components: fetcher
  Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
   Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB
 RAM, although bug is independent of environment
  Reporter: Chris A. Mattmann
  Assignee: Chris A. Mattmann
  Priority: Minor
   Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
   Attachments: NUTCH-139.060105.patch, NUTCH-139.Mattmann.patch.txt,
 NUTCH-139.jc.review.patch.txt
 
  Currently, people are free to name their string-based properties
 anything that they want, such as having names of Content-type, content-
 TyPe, CONTENT_TYPE all having the same meaning. Stefan G. I believe
 proposed a solution in which all property names be converted to lower
 case, but in essence this really only fixes half the problem right (the
 case of identifying that CONTENT_TYPE
  and conTeNT_TyPE and all the permutations are really the same). What
 about
  if I named it Content Type, or ContentType?
   I propose that a way to correct this would be to create a standard set
 of named Strings in the ParseData class that the protocol framework and
 the parsing framework could use to identify common properties such as
 Content-type, Creator, Language, etc.
   The properties would be defined at the top of the ParseData class,
 something like:
   public class ParseData{
 .
  public static final String CONTENT_TYPE = content-type;
  public static final String CREATOR = creator;
 
  }
  In this fashion, users could at least know what the name of the standard
 properties that they can obtain from the ParseData are, for example by
 making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to
 get the content type or a call to
 ParseData.getMetadata().set(ParseData.CONTENT_TYPE, text/xml); Of
 course, this wouldn't preclude users from doing what they are currently
 doing, it would just provide a standard method of obtaining some of the
 more common, critical metadata without pouring over the code base to
 figure out what they are named.
  I'll contribute a patch near the end of the this week, or beg. of next
 week that addresses this issue.
 
 --
 This message is automatically generated by JIRA.
 -
 If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
 -
 For more information on JIRA, see:
http://www.atlassian.com/software/jira