Re: nutch 2.x recrawl re-crawl

J. Gobel Tue, 15 Jan 2013 01:53:00 -0800

Hi,

I stumbled on http://wiki.apache.org/nutch/bin/nutch_inject

I can indicate in my seeds.txt file per url what the fetchtime and scoring
should be. This is great :-)
My seeds.txt now contains , tab delimited.

http://www.latestnews.nl/    nutch.score=10    nutch.fetchInterval=60
userType=open_source

(what is userType?)

Now I wonder, I am not completely familiar how Nutch is fetching the urls.
If I inject the urls/seed.txt on every run, Nutch will fetch the urls from
index.php . but will it also parse them on the same run? Or will Nutch
parse previous fetched Urls? How does this work, and how can I influence
what urls will be fetched?

In my humble opinion, this class should be mentioned in the recrawl wiki.

Thanks in advance,

Jaap

On Tue, Jan 15, 2013 at 1:43 AM, <[email protected]> wrote:

> I think there is no need to a new plugin or something like that. If you
> know list of news urls you need to inject them each cycle in order to fetch
> them and their new inlinks, since when you inject a url its fetch time is
> set to the current time.
>
> Alex.
>
>
>
>
>
>
>
> -----Original Message-----
> From: J. Gobel <[email protected]>
> To: user <[email protected]>
> Sent: Mon, Jan 14, 2013 2:47 pm
> Subject: Re: nutch 2.x recrawl re-crawl
>
>
> I looked at the freegenerator tool.. But I don't see how I can use that
> tool to crawl www.thelatestnews.com/index.php  , scrape the newest links
> from that page, and fetch those.
>
> My dream plugin would be to allow me to feed Nutch priority urls such as ,
> '
> www.thelatestnews.com/index.php' , and that it will fetch all the new
> links
> found on that url with a priority.
>
>
>
> On Mon, Jan 14, 2013 at 11:42 PM, Markus Jelsma
> <[email protected]>wrote:
>
> > freegenerator tool
>
>
>

Re: nutch 2.x recrawl re-crawl

Reply via email to