Using Nutch for Web Site Mirroring

vlad.paunescu Fri, 25 May 2012 04:08:05 -0700

Hello,

I am currently trying to use Nutch as a web site mirroring tool. To be more
explicit, I only need to download the pages, not to index them (I do not
intend to use it as a search engine). I couldn't figure a simpler way to
accomplish my task, so what I do now is:


- crawl the site, using the url;
- merge the segments;
- read segments (dump) and make it show the content.

I didn't manage however to configure Nutch in order to change absolute links
to local links (e.g. href="http://www.example.com/dir/pag.html"; to be
transformed in href="dir/pag.html"). I found URLNormalizer, but I don't
understand what it does, if it only scans the crawled page url and
transforms it, or it scans the content of the page being crawled, and
modifies the href or src attributes.

I would also want to know if you can configure Nutch to create a directory
tree with all the pages it crawled. Now, I only have the dumped content
which needs to be parsed by a Java program I am currently writing in order
to create directory tree that matches the site's structure.

Any help will be much appreciated! Thank you!
Vlad


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Using-Nutch-for-Web-Site-Mirroring-tp3986067.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Using Nutch for Web Site Mirroring

Reply via email to