Thanks, Bayu. I have crawls/data different sources, from wikipedia, from common-crawl, etc. in different formats. For common-crawl, manipulating dns seems problematic, have additionally not all urls from a domain, would be too complex and way too hackish. Looking for a clean way to inject into fetcher the URLs, i.e. to map a subset of them.
Currently, I am thinking to have a cassandra-database with key-value url-content pairs and to interfere directly, where FetcherThread actually fetches - if I have it in my database/whatever take it from there, otherwise http-fetch. But then, politeness rules should be invalidated which looks somehow difficult and I would have to change nutch core functionality. Best regards, Martin -----Original Message----- From: Bayu Widyasanyata <[email protected]> Reply-to: [email protected] To: [email protected] Subject: Re: Nutch fetch local files with arbitrary mapped URLs Date: Sun, 25 May 2014 21:45:48 +0700 Hi Martin, Just put and serves as common web server files inside their "docroot". If their URIs are fixed-URL then you can create a local hostname with local dns support (not provided by Internet DNS). Hope it helps. --- wassalam, [bayu] /sent from Android phone/ On May 24, 2014 7:16 PM, "Martin Aesch" <[email protected]> wrote: > Hi all, > > I have a bunch of HTML files sitting in my file system. I know the http://URL > of each html file. > > If I just fetch from my file system, I will have file:// urls, but I would > like to map them to the http:// adress or to any arbitrary adress. > > Is there any halfway non-hackish possibility for doing that? > > Thanks, > Martin > >

