Re: Nutch as mirroring tool

Vlad Paunescu Fri, 22 Jun 2012 02:11:23 -0700

Hi Lewis,

Thank you very much for your quick reply.
I want to get an expert opinion whether Nutch would be the appropriate tool
for what I want to accomplish, or not. In my team, the opinions are a
little bit divergent: some want to use Nutch for this, but at the opposite
side, some recommend using wget.

We have an web application, in which users have the option to import a site
(the front-end: html, css, js, and other images) into this web application
to have a starting point for their site. Users provide an url and we try to
download all the assets starting from that url, having a limit of 10 MB per
downloaded site.
Currently, a custom-made importer does this, but it is running in the web
server process and thus it's not so efficient. We want an external
application to take care of the import tasks.

A new importer should be implemented outside of the web server, and because
a quick implementation could have bugs, we decided to use a ready made tool.
Apache Nutch was the first choice that came to our minds. And I have to
figure out if nutch is the appropriate tool to get this task done. What
I've managed to do so far is this:

- First, I crawl a site and merge the segments (I've created a custom
Import class that is similar with the Crawl class, which does additional
merging of segments).

- Second, I can read the merged segment with another class that I've
created (the files are key value pairs of some type). Thus, I get the
content of every page. By reading the parse_data from the segment, I can
get the outlinks for a parsed page, so I can search for absolute links
inside the content and replace them with relative paths. The problem is
that I can't keep these synchronized (reading for an url the content, and
reading the for the same url the parse_data). I would also want to know, if
parse_data processes the outlinks inside the css for example (because in
css we can have backround urls for instance, and they can be abolute too).

- Third, and this isn't done yet, is to create a directory structure on the
disk where I write the every website page that was crawled. Nutch doesn't
provide this by default, as far as I know. Am I right?.

These are the steps that I currently have in mind.

Another way of doing this is to have some way to be notified in our program
when a page is fetched by the fetcher, and to do something with it. I would
need to attach a listener to the fetcher, but I am not sure this can be
done without modifying the Nutch source code.

I also run Nutch in the local mode, where hadoop is writing on my local
disk. Do you think that the steps I need to implement in order to write the
file structure on disk should be thought as a map- reduce Hadoop program,
or a normal approach is better? I think I can't use a map-reduce process
because the output is not raw data in the form of key value pairs, but a
structured directory and files tree, like wget does.

So, finally, do you think that Nutch is the appropriate tool for what we
need, or I have to choose another tool?

Thank you,
Vlad

On Wed, Jun 20, 2012 at 11:09 PM, Lewis John Mcgibbney <
[email protected]> wrote:

> Hi Vlad,
>
> On Mon, Jun 18, 2012 at 2:58 PM, Vlad Paunescu <[email protected]>
> wrote:
> > - create a local directory structure which resembles the remote
> structure:
> > is there any elegant way of using the existing Nutch API to accomplish
> > this, or I need to manually create the structure from the segments
> content;
>
> As far as I know Nutch doesn't currently have this
> escaping/transformation between absolute --> relative paths within the
> API. A suitable option would be to implement something of this nature
> within o.a.n.util.URLUtil... thjere are already some excellent methods
> in there to get you started with the kind of URL processing that Nutch
> offers. I think it would be excellent if this kind of mapping could be
> achieved and make configurable... if you get working on it then please
> open an issue if you can.
>
> > - convert links inside every page to relative links. For example, if a
> src
> > points to "http://www.mysite.com/resources/foo.txt"; I need to change
> that
> > to be "/resources/foo.txt" because I want to point to the local file. My
> > question is if I can use the crawl_parse, or parse_data to get the
> links. I
> > am not sure how to do this, using the Nutch API.
>
> The Parse class allows you to access Parse.getOutlinks which would
> then enable you to process them if you could write the correct
> configuration as above.
>
> Let us know how you get on.
>
> hth
>
>
>
> --
> Lewis
>

Re: Nutch as mirroring tool

Reply via email to