Content truncated while using commoncrawldump

jjmendes Thu, 16 Mar 2017 09:00:23 -0700

I am currently attempting to dump the contents of a crawl into multiple
WARC files using


./bin/nutch commoncrawldump -outputDir nameOfOutputDir -segment
crawl/segments/segmentDir -warc

However, I get multiple occurrences of 

URL skipped. Content of size X was truncated to Y. 

I have set both http.content.limit and file.content.limit to -1 in order
to remove any limits, but I'm guessing neither applies to this
situation. Any way of removing said cap? 

Thanks, 

JJAM

Content truncated while using commoncrawldump

Reply via email to