Hello, Here's a workaround as a last resort: I think you can add simple code to remove all occurrences of the string "http://www.example.com/" from a dump if you are going to use a Java program anyway.
Best, Emre On Tue, Jun 12, 2012 at 5:01 PM, Vlad Paunescu <[email protected]>wrote: > Hello, > > I am currently trying to use Nutch as a web site mirroring tool. To be more > explicit, I only need to download the pages, not to index them (I do not > intend to use it as a search engine). I couldn't figure a simpler way to > accomplish my task, so what I do now is: > > - crawl the site, using the url; > - merge the segments; > - read segments (dump) and make it show the content. > > I didn't manage however to configure Nutch in order to change absolute > links to local links (e.g. href="http://www.example.com/dir/pag.html" to > be > transformed in href="dir/pag.html"). I found URLNormalizer, but I don't > understand what it does, if it only scans the crawled page url and > transforms it, or it scans the content of the page being crawled, and > modifies the href or src attributes. > > I would also want to know if you can configure Nutch to create a directory > tree with all the pages it crawled. Now, I only have the dumped content > which needs to be parsed by a Java program I am currently writing in order > to create directory tree that matches the site's structure. > > Any help will be much appreciated! Thank you! >

