Hi, Do you have to use Nutch for this purpose? I belive you can use wget -m http://www.example.com and get everything in a much structured way.
On 25 May 2012 11:07, vlad.paunescu <[email protected]> wrote: > Hello, > > I am currently trying to use Nutch as a web site mirroring tool. To be more > explicit, I only need to download the pages, not to index them (I do not > intend to use it as a search engine). I couldn't figure a simpler way to > accomplish my task, so what I do now is: > > - crawl the site, using the url; > - merge the segments; > - read segments (dump) and make it show the content. > > I didn't manage however to configure Nutch in order to change absolute > links > to local links (e.g. href="http://www.example.com/dir/pag.html" to be > transformed in href="dir/pag.html"). I found URLNormalizer, but I don't > understand what it does, if it only scans the crawled page url and > transforms it, or it scans the content of the page being crawled, and > modifies the href or src attributes. > > I would also want to know if you can configure Nutch to create a directory > tree with all the pages it crawled. Now, I only have the dumped content > which needs to be parsed by a Java program I am currently writing in order > to create directory tree that matches the site's structure. > > Any help will be much appreciated! Thank you! > Vlad > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Using-Nutch-for-Web-Site-Mirroring-tp3986067.html > Sent from the Nutch - User mailing list archive at Nabble.com. >

