Re: Nutch as mirroring tool

Lewis John Mcgibbney Mon, 25 Jun 2012 13:43:00 -0700

Hi Vlad,

On Fri, Jun 22, 2012 at 10:10 AM, Vlad Paunescu <[email protected]> wrote:
> The problem is
> that I can't keep these synchronized (reading for an url the content, and
> reading the for the same url the parse_data).


Why can't you keep this in sync? What have you tried so far? Why isn't
it working?

> I would also want to know, if
> parse_data processes the outlinks inside the css for example (because in
> css we can have backround urls for instance, and they can be abolute too).

Do you have an example of this CSS?
Out of the box Nutch doesn't do processing of outlinks instead
extracting themn from fecthed pages and storing them for future
fetching. Only when an actual URL (outlink) is fetched will it be
processed. In this case these URLs would be treated the exact same as
the ones you fetch and process above.

>
> - Third, and this isn't done yet, is to create a directory structure on the
> disk where I write the every website page that was crawled. Nutch doesn't
> provide this by default, as far as I know. Am I right?.

Nutch has a number of tools for us to obtain specific data about urls
from the crawldb, linkdb etc. Are you wanting to store a complete copy
of every page locally? If this is the case then no Nutch doesn't do
this.

> Another way of doing this is to have some way to be notified in our program
> when a page is fetched by the fetcher, and to do something with it. I would
> need to attach a listener to the fetcher, but I am not sure this can be
> done without modifying the Nutch source code.

A modification to a src distribution is most certainly required
here... This will require you to ascertain exactly when a FetchItem
(@see Fetcher.java) is queued and subsequently fetched. Once this has
been successful you will wish to obtain notification.

>
> I also run Nutch in the local mode, where hadoop is writing on my local
> disk. Do you think that the steps I need to implement in order to write the
> file structure on disk should be thought as a map- reduce Hadoop program,
> or a normal approach is better?

Initially I would say that this depends on what scale you wish to run
the application @?
If you get it working w/o running over Hadoop then fine, if you then
need to scale it up then experiment running MR...

> So, finally, do you think that Nutch is the appropriate tool for what we
> need, or I have to choose another tool?
>

I think in order to achieve what you are trying, there may be a number
of places within Nutch where you'll need to hack. I'm kinda still
uncertain the complete aim of the project which you're working on as
it seems like your trying to use Nutch in a slightly different manner
than users usually do, which undoubtedly may/will requires some code
alterations.

hth

Re: Nutch as mirroring tool

Reply via email to