Re: Nutch 1.x crawl Zip file URLs

Lewis John Mcgibbney Thu, 05 May 2016 19:49:27 -0700

Hi AL,

Yes please see parse-zip plugin
https://github.com/apache/nutch/tree/master/src/plugin/parse-zip
You can register this within the plugin.includes property in nutch-site.xml
Thanks


On Thu, May 5, 2016 at 7:00 PM, <[email protected]> wrote:

> From: A Laxmi <[email protected]>
> To: "[email protected]" <[email protected]>
> Cc:
> Date: Thu, 5 May 2016 21:59:34 -0400
> Subject: Nutch 1.x crawl Zip file URLs
> Hi,
>
> (a) Is it possible to crawl URL of a Zip file using Nutch and index in
> Solr? (pls see example below)
>
> (b) Also, if a zip file URL has PDF files in them, is it possible to use
> Nutch to crawl the Zip file URL and also the PDF file inside the Zip file
> URL?
>
>
> E.g.
> *https://www.abc123.xxx/sites/docs/testing.zip
> <https://www.abc123.xxx/sites/docs/testing.zip>*
> When I unzip above URL - I would have the following:
>
>
> *def.pdf*
>
> *lmn.pdf*
> *reg.pdf*
>
>
> Please advise.
>
> Thanks!
>
> AL
>
>


-- 
*Lewis*

Re: Nutch 1.x crawl Zip file URLs

Reply via email to