Hello, Thanks for the reply. Is there a way to make Nutch dump the contents of a crawl (fetch phase, up to the depth configured), onto the hard disk, not in the form of segments, but in the form of the site's remote structure, to build a directory structure on the disk?
Also, is there a way to configure Nutch to get its seed urls not form the directory on the disk, containing seed.txt, but to pass it (using the API for instance). And I am curious: what happens if there are multipe domains in the seed.txt list, e.g www.yahoo.com, and www.msn.com, the contents of the fetch will be separate for each of the domains, or I will get the contents dump in an interleaved fashion (one page form yahoo, two pages from msn etc). Thank you, On Tue, Jun 12, 2012 at 5:26 PM, Emre Çelikten <[email protected]> wrote: > Hello, > > Here's a workaround as a last resort: I think you can add simple code to > remove all occurrences of the string "http://www.example.com/" from a dump > if you are going to use a Java program anyway. > > Best, > > Emre > > On Tue, Jun 12, 2012 at 5:01 PM, Vlad Paunescu <[email protected] > >wrote: > > > Hello, > > > > I am currently trying to use Nutch as a web site mirroring tool. To be > more > > explicit, I only need to download the pages, not to index them (I do not > > intend to use it as a search engine). I couldn't figure a simpler way to > > accomplish my task, so what I do now is: > > > > - crawl the site, using the url; > > - merge the segments; > > - read segments (dump) and make it show the content. > > > > I didn't manage however to configure Nutch in order to change absolute > > links to local links (e.g. href="http://www.example.com/dir/pag.html" to > > be > > transformed in href="dir/pag.html"). I found URLNormalizer, but I don't > > understand what it does, if it only scans the crawled page url and > > transforms it, or it scans the content of the page being crawled, and > > modifies the href or src attributes. > > > > I would also want to know if you can configure Nutch to create a > directory > > tree with all the pages it crawled. Now, I only have the dumped content > > which needs to be parsed by a Java program I am currently writing in > order > > to create directory tree that matches the site's structure. > > > > Any help will be much appreciated! Thank you! > > >

