Re: Targeting Specific Links
Eric Osgood wrote: Andrzej, Based on what you suggested below, I have begun to write my own scoring plugin: Great! in distributeScoreToOutlinks() if the link contains the string im looking for, I set its score to kept_score and add a flag to the metaData in parseData (KEEP, true). How do I check for this flag in generatorSortValue()? I only see a way to check the score, not a flag. The flag should have been automagically added to the target CrawlDatum metadata after you have updated your crawldb (see the details in CrawlDbReducer). Then in generatorSortValue() you can check for the presence of this flag by using the datum.getMetaData(). BTW - you are right, the Generator doesn't treat Float.MIN_VALUE in any special way ... I thought it did. It's easy to add this, though - in Generator.java:161 just add this: if (sort == Float.MIN_VALUE) { return; } -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Targeting Specific Links
Andrzej, Based on what you suggested below, I have begun to write my own scoring plugin: in distributeScoreToOutlinks() if the link contains the string im looking for, I set its score to kept_score and add a flag to the metaData in parseData (KEEP, true). How do I check for this flag in generatorSortValue()? I only see a way to check the score, not a flag. Thanks, Eric On Oct 7, 2009, at 2:48 AM, Andrzej Bialecki wrote: Eric Osgood wrote: Andrzej, How would I check for a flag during fetch? You would check for a flag during generation - please check ScoringFilter.generatorSortValue(), that's where you can check for a flag and set the sort value to Float.MIN_VALUE - this way the link will never be selected for fetching. And you would put the flag in CrawlDatum metadata when ParseOutputFormat calls ScoringFilter.distributeScoreToOutlinks(). Maybe this explanation can shed some light: Ideally, I would like to check the list of links for each page, but still needing a total of X links per page, if I find the links I want, I add them to the list up until X, if I don' reach X, I add other links until X is reached. This way, I don't waste crawl time on non-relevant links. You can modify the collection of target links passed to distributeScoreToOutlinks() - this way you can affect both which links are stored and what kind of metadata each of them gets. As I said, you can also use just plain URLFilters to filter out unwanted links, but that API gives you much less control because it's a simple yes/no that considers just URL string. The advantage is that it's much easier to implement than a ScoringFilter. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com Eric Osgood - Cal Poly - Computer Engineering, Moon Valley Software - eosg...@calpoly.edu, e...@lakemeadonline.com - www.calpoly.edu/~eosgood, www.lakemeadonline.com
Re: Targeting Specific Links
Also, In the scoring-links plugin, I set the return value for ScoringFilter.generatorSortValue() to Float.MinValue for all urls and it still fetched everything - maybe Float.MinValue isn't the correct value to set so a link never gets fetched? Thanks, Eric On Oct 22, 2009, at 1:10 PM, Eric Osgood wrote: Andrzej, Based on what you suggested below, I have begun to write my own scoring plugin: in distributeScoreToOutlinks() if the link contains the string im looking for, I set its score to kept_score and add a flag to the metaData in parseData (KEEP, true). How do I check for this flag in generatorSortValue()? I only see a way to check the score, not a flag. Thanks, Eric On Oct 7, 2009, at 2:48 AM, Andrzej Bialecki wrote: Eric Osgood wrote: Andrzej, How would I check for a flag during fetch? You would check for a flag during generation - please check ScoringFilter.generatorSortValue(), that's where you can check for a flag and set the sort value to Float.MIN_VALUE - this way the link will never be selected for fetching. And you would put the flag in CrawlDatum metadata when ParseOutputFormat calls ScoringFilter.distributeScoreToOutlinks(). Maybe this explanation can shed some light: Ideally, I would like to check the list of links for each page, but still needing a total of X links per page, if I find the links I want, I add them to the list up until X, if I don' reach X, I add other links until X is reached. This way, I don't waste crawl time on non-relevant links. You can modify the collection of target links passed to distributeScoreToOutlinks() - this way you can affect both which links are stored and what kind of metadata each of them gets. As I said, you can also use just plain URLFilters to filter out unwanted links, but that API gives you much less control because it's a simple yes/no that considers just URL string. The advantage is that it's much easier to implement than a ScoringFilter. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com Eric Osgood - Cal Poly - Computer Engineering, Moon Valley Software - eosg...@calpoly.edu, e...@lakemeadonline.com - www.calpoly.edu/~eosgood, www.lakemeadonline.com Eric Osgood - Cal Poly - Computer Engineering, Moon Valley Software - eosg...@calpoly.edu, e...@lakemeadonline.com - www.calpoly.edu/~eosgood, www.lakemeadonline.com
Re: Targeting Specific Links
Eric Osgood wrote: Andrzej, How would I check for a flag during fetch? You would check for a flag during generation - please check ScoringFilter.generatorSortValue(), that's where you can check for a flag and set the sort value to Float.MIN_VALUE - this way the link will never be selected for fetching. And you would put the flag in CrawlDatum metadata when ParseOutputFormat calls ScoringFilter.distributeScoreToOutlinks(). Maybe this explanation can shed some light: Ideally, I would like to check the list of links for each page, but still needing a total of X links per page, if I find the links I want, I add them to the list up until X, if I don' reach X, I add other links until X is reached. This way, I don't waste crawl time on non-relevant links. You can modify the collection of target links passed to distributeScoreToOutlinks() - this way you can affect both which links are stored and what kind of metadata each of them gets. As I said, you can also use just plain URLFilters to filter out unwanted links, but that API gives you much less control because it's a simple yes/no that considers just URL string. The advantage is that it's much easier to implement than a ScoringFilter. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Targeting Specific Links
Is there a way to inspect the list of links that nutch finds per page and then at that point choose which links I want to include / exclude? that is the ideal remedy to my problem. Eric Osgood - Cal Poly - Computer Engineering Moon Valley Software - eosg...@calpoly.edu e...@lakemeadonline.com - www.calpoly.edu/eosgood www.lakemeadonline.com
Re: Targeting Specific Links
Eric Osgood wrote: Is there a way to inspect the list of links that nutch finds per page and then at that point choose which links I want to include / exclude? that is the ideal remedy to my problem. Yes, look at ParseOutputFormat, you can make this decision there. There are two standard etension points where you can hook up - URLFilters and ScoringFilters. Please note that if you use URLFilters to filter out URL-s too early then they will be rediscovered again and again. A better method to handle this, but also more complicated, is to still include such links but give them a special flag (in metadata) that prevents fetching. This requires that you implement a custom scoring plugin. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Targeting Specific Links
Andrzej, How would I check for a flag during fetch? Maybe this explanation can shed some light: Ideally, I would like to check the list of links for each page, but still needing a total of X links per page, if I find the links I want, I add them to the list up until X, if I don' reach X, I add other links until X is reached. This way, I don't waste crawl time on non- relevant links. Thanks, Eric Osgood - Cal Poly - Computer Engineering, Moon Valley Software - eosg...@calpoly.edu, e...@lakemeadonline.com - www.calpoly.edu/eosgood, www.lakemeadonline.com On Oct 6, 2009, at 1:04 PM, Andrzej Bialecki wrote: Eric Osgood wrote: Is there a way to inspect the list of links that nutch finds per page and then at that point choose which links I want to include / exclude? that is the ideal remedy to my problem. Yes, look at ParseOutputFormat, you can make this decision there. There are two standard etension points where you can hook up - URLFilters and ScoringFilters. Please note that if you use URLFilters to filter out URL-s too early then they will be rediscovered again and again. A better method to handle this, but also more complicated, is to still include such links but give them a special flag (in metadata) that prevents fetching. This requires that you implement a custom scoring plugin. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Targeting Specific Links for Crawling
Does anyone know if it possible to target only certain links for crawling dynamically during a crawl? My goal would be to write a plugin for this functionality but I don't know where to start. Thanks, EO
Re: Targeting Specific Links for Crawling
Eric wrote: Does anyone know if it possible to target only certain links for crawling dynamically during a crawl? My goal would be to write a plugin for this functionality but I don't know where to start. URLFilter plugins may be what you want. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
RE: Targeting Specific Links for Crawling
how to target certain links !! do you know how the links are made !? i mean their format ? you can just set a regular expression to accept only those kind of links Date: Mon, 5 Oct 2009 21:39:52 +0200 From: a...@getopt.org To: nutch-user@lucene.apache.org Subject: Re: Targeting Specific Links for Crawling Eric wrote: Does anyone know if it possible to target only certain links for crawling dynamically during a crawl? My goal would be to write a plugin for this functionality but I don't know where to start. URLFilter plugins may be what you want. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com _ New: Messenger sign-in on the MSN homepage http://go.microsoft.com/?linkid=9677403
Re: Targeting Specific Links for Crawling
Adam, Yes, I have a list of strings I would look for in the link. My plan is to look for X number of links on the site - First looking for the links I want and if they exist, add them, if they don't exist add X links from the site. I am planning to start in the URL Filter plugin. Eric On Oct 5, 2009, at 12:58 PM, BELLINI ADAM wrote: how to target certain links !! do you know how the links are made !? i mean their format ? you can just set a regular expression to accept only those kind of links Date: Mon, 5 Oct 2009 21:39:52 +0200 From: a...@getopt.org To: nutch-user@lucene.apache.org Subject: Re: Targeting Specific Links for Crawling Eric wrote: Does anyone know if it possible to target only certain links for crawling dynamically during a crawl? My goal would be to write a plugin for this functionality but I don't know where to start. URLFilter plugins may be what you want. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com _ New: Messenger sign-in on the MSN homepage http://go.microsoft.com/?linkid=9677403
RE: Targeting Specific Links for Crawling
but when you will start by inject your starting point from your seed...after that nutch will fetch urls and it will bypass those filtred by urlfilter (regular expression)...so to calculate the number X of those URLS you have to crawl all your site !! so for sure if you will not have any regular expression you will have all the links oif your site (with the X needed links), but i guess you wont do that becoz it's a waste of time. i can see just one solutuion is to well set the urlfilter.txt (with the right regular expression). anybody hv other ideas ?? Subject: Re: Targeting Specific Links for Crawling From: e...@lakemeadonline.com Date: Mon, 5 Oct 2009 13:07:25 -0700 To: nutch-user@lucene.apache.org Adam, Yes, I have a list of strings I would look for in the link. My plan is to look for X number of links on the site - First looking for the links I want and if they exist, add them, if they don't exist add X links from the site. I am planning to start in the URL Filter plugin. Eric On Oct 5, 2009, at 12:58 PM, BELLINI ADAM wrote: how to target certain links !! do you know how the links are made !? i mean their format ? you can just set a regular expression to accept only those kind of links Date: Mon, 5 Oct 2009 21:39:52 +0200 From: a...@getopt.org To: nutch-user@lucene.apache.org Subject: Re: Targeting Specific Links for Crawling Eric wrote: Does anyone know if it possible to target only certain links for crawling dynamically during a crawl? My goal would be to write a plugin for this functionality but I don't know where to start. URLFilter plugins may be what you want. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com _ New: Messenger sign-in on the MSN homepage http://go.microsoft.com/?linkid=9677403 _ New! Open Messenger faster on the MSN homepage http://go.microsoft.com/?linkid=9677405