date:20170316

Re: How to configure Apache gora to take only ol as column family ?

2017-03-16 Thread lewis john mcgibbney

Hi suyash, This issue can be addressed by essentially, commenting OUT all of the instances where the WebPage [0] object is augmented within each job (and possibly plugin). An example would be as follows https://github.com/apache/nutch/blob/2.x/src/java/org/apache/nutch/parse/ParseUtil.java#L358

Content truncated while using commoncrawldump

2017-03-16 Thread jjmendes

I am currently attempting to dump the contents of a crawl into multiple WARC files using ./bin/nutch commoncrawldump -outputDir nameOfOutputDir -segment crawl/segments/segmentDir -warc However, I get multiple occurrences of URL skipped. Content of size X was truncated to Y. I have set both