Hi,

Do you have to use Nutch for this purpose? I belive you can use wget -m
http://www.example.com and get everything in a much structured way.

On 25 May 2012 11:07, vlad.paunescu <[email protected]> wrote:

> Hello,
>
> I am currently trying to use Nutch as a web site mirroring tool. To be more
> explicit, I only need to download the pages, not to index them (I do not
> intend to use it as a search engine). I couldn't figure a simpler way to
> accomplish my task, so what I do now is:
>
> - crawl the site, using the url;
> - merge the segments;
> - read segments (dump) and make it show the content.
>
> I didn't manage however to configure Nutch in order to change absolute
> links
> to local links (e.g. href="http://www.example.com/dir/pag.html"; to be
> transformed in href="dir/pag.html"). I found URLNormalizer, but I don't
> understand what it does, if it only scans the crawled page url and
> transforms it, or it scans the content of the page being crawled, and
> modifies the href or src attributes.
>
> I would also want to know if you can configure Nutch to create a directory
> tree with all the pages it crawled. Now, I only have the dumped content
> which needs to be parsed by a Java program I am currently writing in order
> to create directory tree that matches the site's structure.
>
> Any help will be much appreciated! Thank you!
> Vlad
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Using-Nutch-for-Web-Site-Mirroring-tp3986067.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

Reply via email to