Hello, I am currently trying to use Nutch as a web site mirroring tool. To be more explicit, I only need to download the pages, not to index them (I do not intend to use it as a search engine). I couldn't figure a simpler way to accomplish my task, so what I do now is:
- crawl the site, using the url; - merge the segments; - read segments (dump) and make it show the content. I didn't manage however to configure Nutch in order to change absolute links to local links (e.g. href="http://www.example.com/dir/pag.html" to be transformed in href="dir/pag.html"). I found URLNormalizer, but I don't understand what it does, if it only scans the crawled page url and transforms it, or it scans the content of the page being crawled, and modifies the href or src attributes. I would also want to know if you can configure Nutch to create a directory tree with all the pages it crawled. Now, I only have the dumped content which needs to be parsed by a Java program I am currently writing in order to create directory tree that matches the site's structure. Any help will be much appreciated! Thank you! Vlad -- View this message in context: http://lucene.472066.n3.nabble.com/Using-Nutch-for-Web-Site-Mirroring-tp3986067.html Sent from the Nutch - User mailing list archive at Nabble.com.

