hello, I ran into this crawling problem, when my page itself only contains a couple of lines of text, but a lot of images (a lot!). I only care about the text part (for Solr), so I left the the default settings in the /regex-urlfilter.txt/ file like this:
-\.(pdf|PDF|gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$ Shortly after the crawling starts, Nutch says: /XYZ skipped. Content of size 286981 was truncated to 64904/ Is Nutch fetching the images anyway, and counts the full-size against the *http.content.limit*? If this is the case, how can I work around that? (other than increasing the limit...) thanks, --iM -- View this message in context: http://lucene.472066.n3.nabble.com/Skipping-domain-because-of-large-size-tp4057491.html Sent from the Nutch - User mailing list archive at Nabble.com.

