Re: Nutch as mirroring tool

Vlad Paunescu Wed, 27 Jun 2012 05:46:14 -0700

Hi Lewis,

Thank you for your answer, and your patience to respond thoroughly to my
questions! I am currently trying to use a more lightweight tool, crawler4j.
If I won't manage to get something useful, I will return to Nutch.


Vlad

On Mon, Jun 25, 2012 at 11:42 PM, Lewis John Mcgibbney <
[email protected]> wrote:

> Hi Vlad,
>
> On Fri, Jun 22, 2012 at 10:10 AM, Vlad Paunescu <[email protected]>
> wrote:
> > The problem is
> > that I can't keep these synchronized (reading for an url the content, and
> > reading the for the same url the parse_data).
>
> Why can't you keep this in sync? What have you tried so far? Why isn't
> it working?
>
> > I would also want to know, if
> > parse_data processes the outlinks inside the css for example (because in
> > css we can have backround urls for instance, and they can be abolute
> too).
>
> Do you have an example of this CSS?
> Out of the box Nutch doesn't do processing of outlinks instead
> extracting themn from fecthed pages and storing them for future
> fetching. Only when an actual URL (outlink) is fetched will it be
> processed. In this case these URLs would be treated the exact same as
> the ones you fetch and process above.
>
> >
> > - Third, and this isn't done yet, is to create a directory structure on
> the
> > disk where I write the every website page that was crawled. Nutch doesn't
> > provide this by default, as far as I know. Am I right?.
>
> Nutch has a number of tools for us to obtain specific data about urls
> from the crawldb, linkdb etc. Are you wanting to store a complete copy
> of every page locally? If this is the case then no Nutch doesn't do
> this.
>
> > Another way of doing this is to have some way to be notified in our
> program
> > when a page is fetched by the fetcher, and to do something with it. I
> would
> > need to attach a listener to the fetcher, but I am not sure this can be
> > done without modifying the Nutch source code.
>
> A modification to a src distribution is most certainly required
> here... This will require you to ascertain exactly when a FetchItem
> (@see Fetcher.java) is queued and subsequently fetched. Once this has
> been successful you will wish to obtain notification.
>
> >
> > I also run Nutch in the local mode, where hadoop is writing on my local
> > disk. Do you think that the steps I need to implement in order to write
> the
> > file structure on disk should be thought as a map- reduce Hadoop program,
> > or a normal approach is better?
>
> Initially I would say that this depends on what scale you wish to run
> the application @?
> If you get it working w/o running over Hadoop then fine, if you then
> need to scale it up then experiment running MR...
>
> > So, finally, do you think that Nutch is the appropriate tool for what we
> > need, or I have to choose another tool?
> >
>
> I think in order to achieve what you are trying, there may be a number
> of places within Nutch where you'll need to hack. I'm kinda still
> uncertain the complete aim of the project which you're working on as
> it seems like your trying to use Nutch in a slightly different manner
> than users usually do, which undoubtedly may/will requires some code
> alterations.
>
> hth
>

Re: Nutch as mirroring tool

Reply via email to