Re: Nutch 1.x crawl Zip file URLs

A Laxmi Fri, 06 May 2016 11:55:20 -0700

Hi Lewis,

I tried what you suggested but still no change. Please see the log message
below. I put the parse-zip under plugins directory and also edited
nutch-site.xml to include parse-zip under plugin.includes. I hightlighted
the Parse log message below which I think might be the one that didn't go
through.


PLease help!












*2016-05-06 14:47:32,226 INFO  fetcher.Fetcher - Fetcher: finished at
2016-05-06 14:47:32, elapsed: 00:00:272016-05-06 14:47:33,127 INFO
parse.ParseSegment - ParseSegment: starting at 2016-05-06
14:47:332016-05-06 14:47:33,127 INFO  parse.ParseSegment - ParseSegment:
segment: crawl_dir/crawl_zip2-sd/segments/201605061447022016-05-06
14:47:33,497 WARN  util.NativeCodeLoader - Unable to load native-hadoop
library for your platform... using builtin-java classes where
applicable2016-05-06 14:47:34,366 INFO  parse.ParseSegment -
https://www.xyz.xyz/sites/production/files/2016/policyarchive.zip
<https://www.xyz.xyz/sites/production/files/2016/policyarchive.zip>
skipped. Content of size 17027128 was truncated to 52427602016-05-06
14:47:34,896 INFO  parse.ParseSegment - ParseSegment: finished at
2016-05-06 14:47:34, elapsed: 00:00:012016-05-06 14:47:36,010 WARN
util.NativeCodeLoader - Unable to load native-hadoop library for your
platform... using builtin-java classes where applicable2016-05-06
14:47:36,042 INFO  crawl.CrawlDb - CrawlDb update: starting at 2016-05-06
14:47:362016-05-06 14:47:36,042 INFO  crawl.CrawlDb - CrawlDb update: db:
crawl_dir/crawl_zip2-sd/crawldb2016-05-06 14:47:36,042 INFO  crawl.CrawlDb
- CrawlDb update: segments:
[crawl_dir/crawl_zip2-sd/segments/20160506144702]2016-05-06 14:47:36,042
INFO  crawl.CrawlDb - CrawlDb update: additions allowed: true2016-05-06
14:47:36,042 INFO  crawl.CrawlDb - *

Regards,
AL

On Thu, May 5, 2016 at 10:48 PM, Lewis John Mcgibbney <
[email protected]> wrote:

> Hi AL,
>
> Yes please see parse-zip plugin
> https://github.com/apache/nutch/tree/master/src/plugin/parse-zip
> You can register this within the plugin.includes property in nutch-site.xml
> Thanks
>
> On Thu, May 5, 2016 at 7:00 PM, <[email protected]> wrote:
>
> > From: A Laxmi <[email protected]>
> > To: "[email protected]" <[email protected]>
> > Cc:
> > Date: Thu, 5 May 2016 21:59:34 -0400
> > Subject: Nutch 1.x crawl Zip file URLs
> > Hi,
> >
> > (a) Is it possible to crawl URL of a Zip file using Nutch and index in
> > Solr? (pls see example below)
> >
> > (b) Also, if a zip file URL has PDF files in them, is it possible to use
> > Nutch to crawl the Zip file URL and also the PDF file inside the Zip file
> > URL?
> >
> >
> > E.g.
> > *https://www.abc123.xxx/sites/docs/testing.zip
> > <https://www.abc123.xxx/sites/docs/testing.zip>*
> > When I unzip above URL - I would have the following:
> >
> >
> > *def.pdf*
> >
> > *lmn.pdf*
> > *reg.pdf*
> >
> >
> > Please advise.
> >
> > Thanks!
> >
> > AL
> >
> >
>
>
> --
> *Lewis*
>

Re: Nutch 1.x crawl Zip file URLs

Reply via email to