Eric Osgood wrote:
Andrzej,
Based on what you suggested below, I have begun to write my own scoring
plugin:
Great!
in distributeScoreToOutlinks() if the link contains the string im
looking for, I set its score to kept_score and add a flag to the
metaData in parseData (KEEP, true). How
Andrzej,
Based on what you suggested below, I have begun to write my own
scoring plugin:
in distributeScoreToOutlinks() if the link contains the string im
looking for, I set its score to kept_score and add a flag to the
metaData in parseData (KEEP, true). How do I check for this flag
Also,
In the scoring-links plugin, I set the return value for
ScoringFilter.generatorSortValue() to Float.MinValue for all urls and
it still fetched everything - maybe Float.MinValue isn't the correct
value to set so a link never gets fetched?
Thanks,
Eric
On Oct 22, 2009, at 1:10 PM,
Eric Osgood wrote:
Andrzej,
How would I check for a flag during fetch?
You would check for a flag during generation - please check
ScoringFilter.generatorSortValue(), that's where you can check for a
flag and set the sort value to Float.MIN_VALUE - this way the link will
never be selected
Is there a way to inspect the list of links that nutch finds per page
and then at that point choose which links I want to include / exclude?
that is the ideal remedy to my problem.
Eric Osgood
-
Cal Poly - Computer Engineering
Moon Valley Software
Eric Osgood wrote:
Is there a way to inspect the list of links that nutch finds per page
and then at that point choose which links I want to include / exclude?
that is the ideal remedy to my problem.
Yes, look at ParseOutputFormat, you can make this decision there. There
are two standard
Andrzej,
How would I check for a flag during fetch?
Maybe this explanation can shed some light:
Ideally, I would like to check the list of links for each page, but
still needing a total of X links per page, if I find the links I want,
I add them to the list up until X, if I don' reach X, I
Does anyone know if it possible to target only certain links for
crawling dynamically during a crawl? My goal would be to write a
plugin for this functionality but I don't know where to start.
Thanks,
EO
Eric wrote:
Does anyone know if it possible to target only certain links for
crawling dynamically during a crawl? My goal would be to write a plugin
for this functionality but I don't know where to start.
URLFilter plugins may be what you want.
--
Best regards,
Andrzej Bialecki
___.
Specific Links for Crawling
Eric wrote:
Does anyone know if it possible to target only certain links for
crawling dynamically during a crawl? My goal would be to write a plugin
for this functionality but I don't know where to start.
URLFilter plugins may be what you want.
--
Best
@lucene.apache.org
Subject: Re: Targeting Specific Links for Crawling
Eric wrote:
Does anyone know if it possible to target only certain links for
crawling dynamically during a crawl? My goal would be to write a
plugin
for this functionality but I don't know where to start.
URLFilter plugins
can just set a regular expression to accept only those kind of
links
Date: Mon, 5 Oct 2009 21:39:52 +0200
From: a...@getopt.org
To: nutch-user@lucene.apache.org
Subject: Re: Targeting Specific Links for Crawling
Eric wrote:
Does anyone know if it possible to target only certain
12 matches
Mail list logo