Hi suyash,
This issue can be addressed by essentially, commenting OUT all of the
instances where the WebPage [0] object is augmented within each job (and
possibly plugin).
An example would be as follows
https://github.com/apache/nutch/blob/2.x/src/java/org/apache/nutch/parse/ParseUtil.java#L358
I am currently attempting to dump the contents of a crawl into multiple
WARC files using
./bin/nutch commoncrawldump -outputDir nameOfOutputDir -segment
crawl/segments/segmentDir -warc
However, I get multiple occurrences of
URL skipped. Content of size X was truncated to Y.
I have set both
2 matches
Mail list logo