[jira] Updated: (NUTCH-135) http header meta data are case insensitive in the real world (e.g. Content-Type or content-type)

2005-12-10 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-135?page=all ]

Stefan Groschupf updated NUTCH-135:
---

Attachment: contentProperties_patch_WithContentProperties.txt

missed to add the contentproperties itself to the version control... thanks 
Jack! 

 http header meta data are case insensitive in the real world (e.g. 
 Content-Type or content-type)
 

  Key: NUTCH-135
  URL: http://issues.apache.org/jira/browse/NUTCH-135
  Project: Nutch
 Type: Bug
   Components: fetcher
 Versions: 0.7, 0.7.1
 Reporter: Stefan Groschupf
 Priority: Critical
  Fix For: 0.8-dev, 0.7.2-dev
  Attachments: contentProperties_patch.txt, 
 contentProperties_patch_WithContentProperties.txt

 As described in issue nutch-133, some webservers return http header meta data 
 not standard conform case insensitive.
 This provides many negative side effects, for example query thet content type 
 from the meta data return null also in case the webserver returns a content 
 type, but the key is not standard conform e.g. lower case. Also this has 
 effects to the pdf parser that queries the content length etc.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Updated: (NUTCH-135) http header meta data are case insensitive in the real world (e.g. Content-Type or content-type)

2005-12-10 Thread Jerome Charron (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-135?page=all ]

Jerome Charron updated NUTCH-135:
-

Attachment: cached.jsp.patch

cached.jsp must be patched too.

 http header meta data are case insensitive in the real world (e.g. 
 Content-Type or content-type)
 

  Key: NUTCH-135
  URL: http://issues.apache.org/jira/browse/NUTCH-135
  Project: Nutch
 Type: Bug
   Components: fetcher
 Versions: 0.7, 0.7.1
 Reporter: Stefan Groschupf
 Priority: Critical
  Fix For: 0.8-dev, 0.7.2-dev
  Attachments: cached.jsp.patch, contentProperties_patch.txt, 
 contentProperties_patch_WithContentProperties.txt

 As described in issue nutch-133, some webservers return http header meta data 
 not standard conform case insensitive.
 This provides many negative side effects, for example query thet content type 
 from the meta data return null also in case the webserver returns a content 
 type, but the key is not standard conform e.g. lower case. Also this has 
 effects to the pdf parser that queries the content length etc.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Updated: (NUTCH-135) http header meta data are case insensitive in the real world (e.g. Content-Type or content-type)

2005-12-09 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-135?page=all ]

Stefan Groschupf updated NUTCH-135:
---

Attachment: contentProperties_patch.txt

As Doug suggested a patch using TreeMap String.CASE_INSENSITIVE_ORDER that 
solve the problem of case insensitive http header or general case insensitve 
content meta data. 
In general I see  two different ways to solve the problem. First leave the API 
as it is and extend a Properties object to overwriting its methods by using 
behind the sence a TreeMap. This solution would also require to copy some data 
between the properties object and treemap back and for several times, since the 
nutch code uses a Properties object in the content  constructor. The other 
choice would be to change the API of the content object to cleanly document 
that a other object, that has a different behavior than the properties object 
is used. The negative thing on this solution is that there are many small 
changes in the nutch code base. 
However I decide for a clean way, the last way, since I don't like code that 
does some things behind the sence that  developers would not expect. So I 
introduced a tiny ContentProperties object and changed the Content construtor 
to use the ContentProperties object instead of the java.util.Properties object. 
The new ContentProperties has a similar API as the Properties class but use 
case insensitve keys. I changed all classes that use the content object to use 
the new ContentProperties until object instantiation and I also extend the 
Content test case to test if case insensitive keys are now supported. 
Feel free to give constructive improvement suggestions, but also please let get 
us this done as soon as possible since from my point of view this is a critical 
issue.  All testcases pass on my box, but please double check before commiting.

 http header meta data are case insensitive in the real world (e.g. 
 Content-Type or content-type)
 

  Key: NUTCH-135
  URL: http://issues.apache.org/jira/browse/NUTCH-135
  Project: Nutch
 Type: Bug
   Components: fetcher
 Versions: 0.7.1, 0.7
 Reporter: Stefan Groschupf
 Priority: Critical
  Fix For: 0.8-dev, 0.7.2-dev
  Attachments: contentProperties_patch.txt

 As described in issue nutch-133, some webservers return http header meta data 
 not standard conform case insensitive.
 This provides many negative side effects, for example query thet content type 
 from the meta data return null also in case the webserver returns a content 
 type, but the key is not standard conform e.g. lower case. Also this has 
 effects to the pdf parser that queries the content length etc.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira