hello,

I ran into this crawling problem, when my page itself only contains a couple
of lines of text, but a lot of images (a lot!). 
I only care about the text part (for Solr), so I left the the default
settings in the /regex-urlfilter.txt/ file like this:

-\.(pdf|PDF|gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$

Shortly after the crawling starts, Nutch says: 
/XYZ skipped. Content of size 286981 was truncated to 64904/

Is Nutch fetching the images anyway, and counts the full-size against the
*http.content.limit*?

If this is the case, how can I work around that? (other than increasing the
limit...)

thanks,
--iM



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Skipping-domain-because-of-large-size-tp4057491.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to