Re: Crawling web and intranet files into single crawldb

Bayu Widyasanyata Wed, 04 Jun 2014 05:48:13 -0700

Hi Markus,

The following files should I configured:


= prefix-urlfilter.txt: put file:// which is already configured.
= regex-urlfilter.txt: update following line -^(file|ftp|mailto) to
-^(ftp|mailto):
= urls/seed.txt: add new URL/file path.

...and start crawling.

Is it enough? CMIIW

Thanks-



On Wed, Jun 4, 2014 at 7:33 PM, Markus Jelsma <[email protected]>
wrote:

> Hi Bayu,
>
>
> You must enabled the protocol-file first. Then make sure the file://
> prefix is not filtered via prefix-urlfilter.txt or any other. Now just
> inject new URL's and start the crawl.
>
>
> Cheers
>
>
>
> -----Original message-----
> From:Bayu Widyasanyata <[email protected]>
> Sent:Wed 04-06-2014 14:30
> Subject:Crawling web and intranet files into single crawldb
> To:[email protected];
> Hi,
>
> I successfully running nutch 1.8 and Solr 4.8.1 to fetch and index web
> sources (http protocol).
> And now I want add file share data sources (file protocol) into current
> crawldb.
>
> What is the strategy or common practices to handle this situations?
>
> Thank you.-
>
> --
> wassalam,
> [bayu]
>



-- 
wassalam,
[bayu]

Re: Crawling web and intranet files into single crawldb

Reply via email to