Hi Jaap,

On Tue, Jan 15, 2013 at 1:52 AM, J. Gobel <[email protected]> wrote:

> http://www.latestnews.nl/    nutch.score=10    nutch.fetchInterval=60
> userType=open_source
>
> (what is userType?)
>

AFAIK this can be any metadata field which you then assign a value. In the
example here we use userType to refer to the nature of Nutch as an open
source project. I've updated with wiki with this, thanks for pointing it
out.


>
> Now I wonder, I am not completely familiar how Nutch is fetching the urls.
> If I inject the urls/seed.txt on every run, Nutch will fetch the urls from
> index.php . but will it also parse them on the same run?


Nutch will fetch and parse the first round of URLs in your seed.txt and
will extract the outlinks based on the -topN parameter for the next round
of fetching and subsequent parsing.


> Or will Nutch
> parse previous fetched Urls?


If URLs are due for a fetch, then when you generate Nutch attempts to fetch
them. Once they have been fetched, they should be parsed.

How does this work, and how can I influence
> what urls will be fetched?
>

Well you can start with assigning a custom scoring value and fetching
interval to the injected URLs.


>
> In my humble opinion, this class should be mentioned in the recrawl wiki.
>

If you would like to contribute to the wiki documentation then please sign
up to the wiki and I can grant you the necessary karma.


>
> Lewis

Reply via email to