Hi, I stumbled on http://wiki.apache.org/nutch/bin/nutch_inject
I can indicate in my seeds.txt file per url what the fetchtime and scoring should be. This is great :-) My seeds.txt now contains , tab delimited. http://www.latestnews.nl/ nutch.score=10 nutch.fetchInterval=60 userType=open_source (what is userType?) Now I wonder, I am not completely familiar how Nutch is fetching the urls. If I inject the urls/seed.txt on every run, Nutch will fetch the urls from index.php . but will it also parse them on the same run? Or will Nutch parse previous fetched Urls? How does this work, and how can I influence what urls will be fetched? In my humble opinion, this class should be mentioned in the recrawl wiki. Thanks in advance, Jaap On Tue, Jan 15, 2013 at 1:43 AM, <[email protected]> wrote: > I think there is no need to a new plugin or something like that. If you > know list of news urls you need to inject them each cycle in order to fetch > them and their new inlinks, since when you inject a url its fetch time is > set to the current time. > > Alex. > > > > > > > > -----Original Message----- > From: J. Gobel <[email protected]> > To: user <[email protected]> > Sent: Mon, Jan 14, 2013 2:47 pm > Subject: Re: nutch 2.x recrawl re-crawl > > > I looked at the freegenerator tool.. But I don't see how I can use that > tool to crawl www.thelatestnews.com/index.php , scrape the newest links > from that page, and fetch those. > > My dream plugin would be to allow me to feed Nutch priority urls such as , > ' > www.thelatestnews.com/index.php' , and that it will fetch all the new > links > found on that url with a priority. > > > > On Mon, Jan 14, 2013 at 11:42 PM, Markus Jelsma > <[email protected]>wrote: > > > freegenerator tool > > >

