Re: Nutch as a crawler

Emre Çelikten Tue, 12 Jun 2012 07:27:21 -0700

Hello,

Here's a workaround as a last resort: I think you can add simple code to
remove all occurrences of the string "http://www.example.com/"; from a dump
if you are going to use a Java program anyway.


Best,

Emre

On Tue, Jun 12, 2012 at 5:01 PM, Vlad Paunescu <[email protected]>wrote:

> Hello,
>
> I am currently trying to use Nutch as a web site mirroring tool. To be more
> explicit, I only need to download the pages, not to index them (I do not
> intend to use it as a search engine). I couldn't figure a simpler way to
> accomplish my task, so what I do now is:
>
> - crawl the site, using the url;
> - merge the segments;
> - read segments (dump) and make it show the content.
>
> I didn't manage however to configure Nutch in order to change absolute
> links to local links (e.g. href="http://www.example.com/dir/pag.html"; to
> be
> transformed in href="dir/pag.html"). I found URLNormalizer, but I don't
> understand what it does, if it only scans the crawled page url and
> transforms it, or it scans the content of the page being crawled, and
> modifies the href or src attributes.
>
> I would also want to know if you can configure Nutch to create a directory
> tree with all the pages it crawled. Now, I only have the dumped content
> which needs to be parsed by a Java program I am currently writing in order
> to create directory tree that matches the site's structure.
>
> Any help will be much appreciated! Thank you!
>

Re: Nutch as a crawler

Reply via email to