Nutch as mirroring tool

Vlad Paunescu Mon, 18 Jun 2012 06:59:18 -0700

Hello,

I want to use nutch for website mirroring, to import starting from a remote
url.


I managed already to create a program that fetches, then merges segments
and reads the content of the segments.

What I want to do next is:

- create a local directory structure which resembles the remote structure:
is there any elegant way of using the existing Nutch API to accomplish
this, or I need to manually create the structure from the segments content;
- convert links inside every page to relative links. For example, if a src
points to "http://www.mysite.com/resources/foo.txt"; I need to change that
to be "/resources/foo.txt" because I want to point to the local file. My
question is if I can use the crawl_parse, or parse_data to get the links. I
am not sure how to do this, using the Nutch API.

Thank you,
Vlad

Nutch as mirroring tool

Reply via email to