Re: Nutch fetch local files with arbitrary mapped URLs

Martin Aesch Thu, 29 May 2014 18:21:02 -0700

Thanks, Bayu.

I have crawls/data different sources, from wikipedia, from common-crawl,
etc. in different formats. For common-crawl, manipulating dns seems
problematic, have additionally not all urls from a domain, would be too
complex and way too hackish. Looking for a clean way to inject into
fetcher the URLs, i.e. to map a subset of them.

Currently, I am thinking to have a cassandra-database with key-value
url-content pairs and to interfere directly, where FetcherThread
actually fetches - if I have it in my database/whatever take it from
there, otherwise http-fetch. But then, politeness rules should be
invalidated which looks somehow difficult and I would have to change
nutch core functionality.

Best regards,
Martin

-----Original Message-----
From: Bayu Widyasanyata <[email protected]>
Reply-to: [email protected]
To: [email protected]
Subject: Re: Nutch fetch local files with arbitrary mapped URLs
Date: Sun, 25 May 2014 21:45:48 +0700

Hi Martin,

Just put and serves as common web server files inside their "docroot".

If their URIs are fixed-URL then you can create a local hostname with local
dns support (not provided by Internet DNS).

Hope it helps.
---
wassalam,
[bayu]

/sent from Android phone/
On May 24, 2014 7:16 PM, "Martin Aesch" <[email protected]> wrote:

> Hi all,
>
> I have a bunch of HTML files sitting in my file system. I know the http://URL 
> of each html file.
>
> If I just fetch from my file system, I will have file:// urls, but I would
> like to map them to the http:// adress or to any arbitrary adress.
>
> Is there any halfway non-hackish possibility for doing that?
>
> Thanks,
> Martin
>
>

Re: Nutch fetch local files with arbitrary mapped URLs

Reply via email to