Re: Targeting Specific Links

2009-10-23 Thread Andrzej Bialecki

Eric Osgood wrote:

Andrzej,

Based on what you suggested below, I have begun to write my own scoring 
plugin:


Great!



in distributeScoreToOutlinks() if the link contains the string im 
looking for, I set its score to kept_score and add a flag to the 
metaData in parseData (KEEP, true). How do I check for this flag in 
generatorSortValue()? I only see a way to check the score, not a flag.


The flag should have been automagically added to the target CrawlDatum 
metadata after you have updated your crawldb (see the details in 
CrawlDbReducer). Then in generatorSortValue() you can check for the 
presence of this flag by using the datum.getMetaData().


BTW - you are right, the Generator doesn't treat Float.MIN_VALUE in any 
special way ... I thought it did. It's easy to add this, though - in 
Generator.java:161 just add this:


if (sort == Float.MIN_VALUE) {
return;
}


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Targeting Specific Links

2009-10-22 Thread Eric Osgood

Andrzej,

Based on what you suggested below, I have begun to write my own  
scoring plugin:


in distributeScoreToOutlinks() if the link contains the string im  
looking for, I set its score to kept_score and add a flag to the  
metaData in parseData (KEEP, true). How do I check for this flag  
in generatorSortValue()? I only see a way to check the score, not a  
flag.


Thanks,

Eric


On Oct 7, 2009, at 2:48 AM, Andrzej Bialecki wrote:


Eric Osgood wrote:

Andrzej,
How would I check for a flag during fetch?


You would check for a flag during generation - please check  
ScoringFilter.generatorSortValue(), that's where you can check for a  
flag and set the sort value to Float.MIN_VALUE - this way the link  
will never be selected for fetching.


And you would put the flag in CrawlDatum metadata when  
ParseOutputFormat calls ScoringFilter.distributeScoreToOutlinks().



Maybe this explanation can shed some light:
Ideally, I would like to check the list of links for each page, but  
still needing a total of X links per page, if I find the links I  
want, I add them to the list up until X, if I don' reach X, I add  
other links until X is reached. This way, I don't waste crawl time  
on non-relevant links.


You can modify the collection of target links passed to  
distributeScoreToOutlinks() - this way you can affect both which  
links are stored and what kind of metadata each of them gets.


As I said, you can also use just plain URLFilters to filter out  
unwanted links, but that API gives you much less control because  
it's a simple yes/no that considers just URL string. The advantage  
is that it's much easier to implement than a ScoringFilter.



--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Eric Osgood
-
Cal Poly - Computer Engineering, Moon Valley Software
-
eosg...@calpoly.edu, e...@lakemeadonline.com
-
www.calpoly.edu/~eosgood, www.lakemeadonline.com



Re: Targeting Specific Links

2009-10-22 Thread Eric Osgood

Also,

In the scoring-links plugin, I set the return value for  
ScoringFilter.generatorSortValue() to Float.MinValue for all urls and  
it still fetched everything - maybe Float.MinValue isn't the correct  
value to set so a link never gets fetched?


Thanks,

Eric

On Oct 22, 2009, at 1:10 PM, Eric Osgood wrote:


Andrzej,

Based on what you suggested below, I have begun to write my own  
scoring plugin:


in distributeScoreToOutlinks() if the link contains the string im  
looking for, I set its score to kept_score and add a flag to the  
metaData in parseData (KEEP, true). How do I check for this flag  
in generatorSortValue()? I only see a way to check the score, not a  
flag.


Thanks,

Eric


On Oct 7, 2009, at 2:48 AM, Andrzej Bialecki wrote:


Eric Osgood wrote:

Andrzej,
How would I check for a flag during fetch?


You would check for a flag during generation - please check  
ScoringFilter.generatorSortValue(), that's where you can check for  
a flag and set the sort value to Float.MIN_VALUE - this way the  
link will never be selected for fetching.


And you would put the flag in CrawlDatum metadata when  
ParseOutputFormat calls ScoringFilter.distributeScoreToOutlinks().



Maybe this explanation can shed some light:
Ideally, I would like to check the list of links for each page,  
but still needing a total of X links per page, if I find the links  
I want, I add them to the list up until X, if I don' reach X, I  
add other links until X is reached. This way, I don't waste crawl  
time on non-relevant links.


You can modify the collection of target links passed to  
distributeScoreToOutlinks() - this way you can affect both which  
links are stored and what kind of metadata each of them gets.


As I said, you can also use just plain URLFilters to filter out  
unwanted links, but that API gives you much less control because  
it's a simple yes/no that considers just URL string. The advantage  
is that it's much easier to implement than a ScoringFilter.



--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Eric Osgood
-
Cal Poly - Computer Engineering, Moon Valley Software
-
eosg...@calpoly.edu, e...@lakemeadonline.com
-
www.calpoly.edu/~eosgood, www.lakemeadonline.com



Eric Osgood
-
Cal Poly - Computer Engineering, Moon Valley Software
-
eosg...@calpoly.edu, e...@lakemeadonline.com
-
www.calpoly.edu/~eosgood, www.lakemeadonline.com



Re: Targeting Specific Links

2009-10-07 Thread Andrzej Bialecki

Eric Osgood wrote:

Andrzej,

How would I check for a flag during fetch?


You would check for a flag during generation - please check 
ScoringFilter.generatorSortValue(), that's where you can check for a 
flag and set the sort value to Float.MIN_VALUE - this way the link will 
never be selected for fetching.


And you would put the flag in CrawlDatum metadata when ParseOutputFormat 
calls ScoringFilter.distributeScoreToOutlinks().




Maybe this explanation can shed some light:
Ideally, I would like to check the list of links for each page, but 
still needing a total of X links per page, if I find the links I want, I 
add them to the list up until X, if I don' reach X, I add other links 
until X is reached. This way, I don't waste crawl time on non-relevant 
links.


You can modify the collection of target links passed to 
distributeScoreToOutlinks() - this way you can affect both which links 
are stored and what kind of metadata each of them gets.


As I said, you can also use just plain URLFilters to filter out unwanted 
links, but that API gives you much less control because it's a simple 
yes/no that considers just URL string. The advantage is that it's much 
easier to implement than a ScoringFilter.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Targeting Specific Links

2009-10-06 Thread Eric Osgood
Is there a way to inspect the list of links that nutch finds per page  
and then at that point choose which links I want to include / exclude?  
that is the ideal remedy to my problem.


Eric Osgood
-
Cal Poly - Computer Engineering
Moon Valley Software
-
eosg...@calpoly.edu
e...@lakemeadonline.com
-
www.calpoly.edu/eosgood
www.lakemeadonline.com



Re: Targeting Specific Links

2009-10-06 Thread Andrzej Bialecki

Eric Osgood wrote:
Is there a way to inspect the list of links that nutch finds per page 
and then at that point choose which links I want to include / exclude? 
that is the ideal remedy to my problem.


Yes, look at ParseOutputFormat, you can make this decision there. There 
are two standard etension points where you can hook up - URLFilters and 
ScoringFilters.


Please note that if you use URLFilters to filter out URL-s too early 
then they will be rediscovered again and again. A better method to 
handle this, but also more complicated, is to still include such links 
but give them a special flag (in metadata) that prevents fetching. This 
requires that you implement a custom scoring plugin.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Targeting Specific Links

2009-10-06 Thread Eric Osgood

Andrzej,

How would I check for a flag during fetch?

Maybe this explanation can shed some light:
Ideally, I would like to check the list of links for each page, but  
still needing a total of X links per page, if I find the links I want,  
I add them to the list up until X, if I don' reach X, I add other  
links until X is reached. This way, I don't waste crawl time on non- 
relevant links.


Thanks,

Eric Osgood
-
Cal Poly - Computer Engineering, Moon Valley Software
-
eosg...@calpoly.edu, e...@lakemeadonline.com
-
www.calpoly.edu/eosgood, www.lakemeadonline.com


On Oct 6, 2009, at 1:04 PM, Andrzej Bialecki wrote:


Eric Osgood wrote:
Is there a way to inspect the list of links that nutch finds per  
page and then at that point choose which links I want to include /  
exclude? that is the ideal remedy to my problem.


Yes, look at ParseOutputFormat, you can make this decision there.  
There are two standard etension points where you can hook up -  
URLFilters and ScoringFilters.


Please note that if you use URLFilters to filter out URL-s too early  
then they will be rediscovered again and again. A better method to  
handle this, but also more complicated, is to still include such  
links but give them a special flag (in metadata) that prevents  
fetching. This requires that you implement a custom scoring plugin.



--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com







Targeting Specific Links for Crawling

2009-10-05 Thread Eric
Does anyone know if it possible to target only certain links for  
crawling dynamically during a crawl? My goal would be to write a  
plugin for this functionality but I don't know where to start.


Thanks,

EO


Re: Targeting Specific Links for Crawling

2009-10-05 Thread Andrzej Bialecki

Eric wrote:
Does anyone know if it possible to target only certain links for 
crawling dynamically during a crawl? My goal would be to write a plugin 
for this functionality but I don't know where to start.


URLFilter plugins may be what you want.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



RE: Targeting Specific Links for Crawling

2009-10-05 Thread BELLINI ADAM



how to target certain links !! do you know how the links are made !? i mean 
their format ?
you can just set a regular expression to accept only those kind of links 



 Date: Mon, 5 Oct 2009 21:39:52 +0200
 From: a...@getopt.org
 To: nutch-user@lucene.apache.org
 Subject: Re: Targeting Specific Links for Crawling
 
 Eric wrote:
  Does anyone know if it possible to target only certain links for 
  crawling dynamically during a crawl? My goal would be to write a plugin 
  for this functionality but I don't know where to start.
 
 URLFilter plugins may be what you want.
 
 
 -- 
 Best regards,
 Andrzej Bialecki 
   ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com
 
  
_
New: Messenger sign-in on the MSN homepage
http://go.microsoft.com/?linkid=9677403

Re: Targeting Specific Links for Crawling

2009-10-05 Thread Eric

Adam,

Yes, I have a list of strings I would look for in the link. My plan is  
to look for X number of links on the site - First looking for the  
links I want and if they exist, add them, if they don't  exist add X  
links from the site. I am planning to start in the URL Filter plugin.


Eric

On Oct 5, 2009, at 12:58 PM, BELLINI ADAM wrote:





how to target certain links !! do you know how the links are made !?  
i mean their format ?
you can just set a regular expression to accept only those kind of  
links





Date: Mon, 5 Oct 2009 21:39:52 +0200
From: a...@getopt.org
To: nutch-user@lucene.apache.org
Subject: Re: Targeting Specific Links for Crawling

Eric wrote:

Does anyone know if it possible to target only certain links for
crawling dynamically during a crawl? My goal would be to write a  
plugin

for this functionality but I don't know where to start.


URLFilter plugins may be what you want.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



_
New: Messenger sign-in on the MSN homepage
http://go.microsoft.com/?linkid=9677403




RE: Targeting Specific Links for Crawling

2009-10-05 Thread BELLINI ADAM

but when  you will start by inject your starting point from your seed...after 
that nutch will fetch urls and it will bypass those filtred by urlfilter 
(regular expression)...so to calculate the number X of those URLS you have to 
crawl all your site !!
so for sure if you will not have any regular expression you will have all the 
links oif your site (with the X needed links), but i guess you wont do that 
becoz it's a waste of time.
i can see just one solutuion is to well set the urlfilter.txt (with the right 
regular expression).
anybody hv other ideas ??







 Subject: Re: Targeting Specific Links for Crawling
 From: e...@lakemeadonline.com
 Date: Mon, 5 Oct 2009 13:07:25 -0700
 To: nutch-user@lucene.apache.org
 
 Adam,
 
 Yes, I have a list of strings I would look for in the link. My plan is  
 to look for X number of links on the site - First looking for the  
 links I want and if they exist, add them, if they don't  exist add X  
 links from the site. I am planning to start in the URL Filter plugin.
 
 Eric
 
 On Oct 5, 2009, at 12:58 PM, BELLINI ADAM wrote:
 
 
 
 
  how to target certain links !! do you know how the links are made !?  
  i mean their format ?
  you can just set a regular expression to accept only those kind of  
  links
 
 
 
  Date: Mon, 5 Oct 2009 21:39:52 +0200
  From: a...@getopt.org
  To: nutch-user@lucene.apache.org
  Subject: Re: Targeting Specific Links for Crawling
 
  Eric wrote:
  Does anyone know if it possible to target only certain links for
  crawling dynamically during a crawl? My goal would be to write a  
  plugin
  for this functionality but I don't know where to start.
 
  URLFilter plugins may be what you want.
 
 
  -- 
  Best regards,
  Andrzej Bialecki 
   ___. ___ ___ ___ _ _   __
  [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
  ___|||__||  \|  ||  |  Embedded Unix, System Integration
  http://www.sigram.com  Contact: info at sigram dot com
 
  
  _
  New: Messenger sign-in on the MSN homepage
  http://go.microsoft.com/?linkid=9677403
 
  
_
New! Open Messenger faster on the MSN homepage
http://go.microsoft.com/?linkid=9677405