Hi, Many of the urls we crawl have headers that look like this: Connection: close Date: Thu, 21 Jun 2007 09:28:42 GMT Accept-Ranges: bytes ETag: "2c0c3-650-cc1eb800" Server: Apache/2.0.40 (Red Hat Linux) Content-Length: 1616 Content-Type: text/html; charset=ISO-8859-1 Last-Modified: Mon, 09 Apr 2007 13:13:04 GMT Client-Date: Thu, 21 Jun 2007 07:42:10 GMT Client-Peer: 202.141.129.22:80 Client-Response-Num: 1 In this case, the cType variable is set to "text/html; charset=ISO-8859-1" in HttpResponse.java (for both protocol-http and protocol-httpclient). In this case, the mimeType cannot be found correctly in HttpResponse.java. I am talking about this piece of code here: /* * Extract the content type from the response and then look for its * mimetype preferences specified in mime-type.xml */ String ctype = headers.get(Response.CONTENT_TYPE); int downloadSize = 0; if (ctype != null && (mimeType = http.getMimeTypes().forName(ctype)) != null) { In this case, the ctype should actually be set to just "text/html". Currently, since it's set to "text/html; charset=ISO-8859-1", mimeType variable is coming out to be null. Thus neither the content limit specified in mimetypes.xml nor the http.content.limit setting is respected for these documents. One solution to the problem is to actually check the cType, split on ";" and take the first part to lookup the mimeType. Anyone got any other ideas? -vishal.