Images and html content are identified as different urls by nutch. In browser, right click on an html page having images, you will see the html tags and your text but not the image to get a feeling about this.
The message you saw might be for: (1) Some file format (non-textual) which is not rejected by the regex rules. OR (2) A super large html (to be precise textual) page. You must inspect the urls for which you get the message. Thanks, Tejas On Fri, Apr 19, 2013 at 2:30 PM, imehesz <[email protected]> wrote: > hello, > > I ran into this crawling problem, when my page itself only contains a > couple > of lines of text, but a lot of images (a lot!). > I only care about the text part (for Solr), so I left the the default > settings in the /regex-urlfilter.txt/ file like this: > > > -\.(pdf|PDF|gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$ > > Shortly after the crawling starts, Nutch says: > /XYZ skipped. Content of size 286981 was truncated to 64904/ > > Is Nutch fetching the images anyway, and counts the full-size against the > *http.content.limit*? > > If this is the case, how can I work around that? (other than increasing the > limit...) > > thanks, > --iM > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Skipping-domain-because-of-large-size-tp4057491.html > Sent from the Nutch - User mailing list archive at Nabble.com. >

