Hi AL, The content is being truncated at some 524276020 Bytes. Increase or disable the http.content.limit https://github.com/apache/nutch/blob/trunk/conf/nutch-default.xml#L216-L224
On Mon, May 9, 2016 at 2:32 AM, <[email protected]> wrote: > > From: A Laxmi <[email protected]> > To: "[email protected]" <[email protected]> > Cc: > Date: Fri, 6 May 2016 14:54:29 -0400 > Subject: Re: Nutch 1.x crawl Zip file URLs > Hi Lewis, > > I tried what you suggested but still no change. Please see the log message > below. I put the parse-zip under plugins directory and also edited > nutch-site.xml to include parse-zip under plugin.includes. I hightlighted > the Parse log message below which I think might be the one that didn't go > through. > > PLease help! > > *2016-05-06 14:47:32,226 INFO fetcher.Fetcher - Fetcher: finished at > 2016-05-06 14:47:32, elapsed: 00:00:272016-05-06 14:47:33,127 INFO > parse.ParseSegment - ParseSegment: starting at 2016-05-06 > 14:47:332016-05-06 14:47:33,127 INFO parse.ParseSegment - ParseSegment: > segment: crawl_dir/crawl_zip2-sd/segments/201605061447022016-05-06 > 14:47:33,497 WARN util.NativeCodeLoader - Unable to load native-hadoop > library for your platform... using builtin-java classes where > applicable2016-05-06 14:47:34,366 INFO parse.ParseSegment - > https://www.xyz.xyz/sites/production/files/2016/policyarchive.zip > <https://www.xyz.xyz/sites/production/files/2016/policyarchive.zip> > skipped. Content of size 17027128 was truncated to 52427602016-05-06 > 14:47:34,896 INFO parse.ParseSegment - ParseSegment: finished at > 2016-05-06 14:47:34, elapsed: 00:00:012016-05-06 14:47:36,010 WARN > util.NativeCodeLoader - Unable to load native-hadoop library for your > platform... using builtin-java classes where applicable2016-05-06 > 14:47:36,042 INFO crawl.CrawlDb - CrawlDb update: starting at 2016-05-06 > 14:47:362016-05-06 14:47:36,042 INFO crawl.CrawlDb - CrawlDb update: db: > crawl_dir/crawl_zip2-sd/crawldb2016-05-06 14:47:36,042 INFO crawl.CrawlDb > - CrawlDb update: segments: > [crawl_dir/crawl_zip2-sd/segments/20160506144702]2016-05-06 14:47:36,042 > INFO crawl.CrawlDb - CrawlDb update: additions allowed: true2016-05-06 > 14:47:36,042 INFO crawl.CrawlDb - * > >

