Re: Nutch as a crawler

Vlad Paunescu Fri, 15 Jun 2012 05:43:50 -0700

Hello,

Thanks for the reply. Is there a way to make Nutch dump the contents of a
crawl (fetch phase, up to the depth configured), onto the hard disk, not in
the form of segments, but in the form of the site's remote structure, to
build a directory structure on the disk?


Also, is there a way to configure Nutch to get its seed urls not form the
directory on the disk, containing seed.txt, but to pass it (using the API
for instance). And I am curious: what happens if there are multipe domains
in the seed.txt list, e.g www.yahoo.com, and www.msn.com, the contents of
the fetch will be separate for each of the domains, or I will get the
contents dump in an interleaved fashion (one page form yahoo, two pages
from msn etc).

Thank you,

On Tue, Jun 12, 2012 at 5:26 PM, Emre Çelikten <[email protected]> wrote:

> Hello,
>
> Here's a workaround as a last resort: I think you can add simple code to
> remove all occurrences of the string "http://www.example.com/"; from a dump
> if you are going to use a Java program anyway.
>
> Best,
>
> Emre
>
> On Tue, Jun 12, 2012 at 5:01 PM, Vlad Paunescu <[email protected]
> >wrote:
>
> > Hello,
> >
> > I am currently trying to use Nutch as a web site mirroring tool. To be
> more
> > explicit, I only need to download the pages, not to index them (I do not
> > intend to use it as a search engine). I couldn't figure a simpler way to
> > accomplish my task, so what I do now is:
> >
> > - crawl the site, using the url;
> > - merge the segments;
> > - read segments (dump) and make it show the content.
> >
> > I didn't manage however to configure Nutch in order to change absolute
> > links to local links (e.g. href="http://www.example.com/dir/pag.html"; to
> > be
> > transformed in href="dir/pag.html"). I found URLNormalizer, but I don't
> > understand what it does, if it only scans the crawled page url and
> > transforms it, or it scans the content of the page being crawled, and
> > modifies the href or src attributes.
> >
> > I would also want to know if you can configure Nutch to create a
> directory
> > tree with all the pages it crawled. Now, I only have the dumped content
> > which needs to be parsed by a Java program I am currently writing in
> order
> > to create directory tree that matches the site's structure.
> >
> > Any help will be much appreciated! Thank you!
> >
>

Re: Nutch as a crawler

Reply via email to