Hi,
 
  Many of the urls we crawl have headers that look like this:
 
Connection: close
Date: Thu, 21 Jun 2007 09:28:42 GMT
Accept-Ranges: bytes
ETag: "2c0c3-650-cc1eb800"
Server: Apache/2.0.40 (Red Hat Linux)
Content-Length: 1616
Content-Type: text/html; charset=ISO-8859-1
Last-Modified: Mon, 09 Apr 2007 13:13:04 GMT
Client-Date: Thu, 21 Jun 2007 07:42:10 GMT
Client-Peer: 202.141.129.22:80
Client-Response-Num: 1
 
In this case, the cType variable is set to "text/html; charset=ISO-8859-1"
in HttpResponse.java (for both protocol-http and protocol-httpclient). In
this case, the mimeType cannot be found correctly in HttpResponse.java. I am
talking about this piece of code here:
 
     /*
       * Extract the content type from the response and then look for its
       * mimetype preferences specified in mime-type.xml
       */
     String ctype = headers.get(Response.CONTENT_TYPE);
      int downloadSize = 0;
      if (ctype != null && (mimeType = http.getMimeTypes().forName(ctype))
!= null) {
 
In this case, the ctype should actually be set to just "text/html".
Currently, since it's set to "text/html; charset=ISO-8859-1", mimeType
variable is coming out to be null. Thus neither the content limit specified
in mimetypes.xml nor the http.content.limit setting is respected for these
documents.
 
One solution to the problem is to actually check the cType, split on ";" and
take the first part to lookup the mimeType. Anyone got any other ideas?
 
-vishal.

Reply via email to