Re: Targeting Specific Links for Crawling
Eric wrote: Does anyone know if it possible to target only certain links for crawling dynamically during a crawl? My goal would be to write a plugin for this functionality but I don't know where to start. URLFilter plugins may be what you want. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
RE: Targeting Specific Links for Crawling
how to target certain links !! do you know how the links are made !? i mean their format ? you can just set a regular expression to accept only those kind of links Date: Mon, 5 Oct 2009 21:39:52 +0200 From: a...@getopt.org To: nutch-user@lucene.apache.org Subject: Re: Targeting Specific Links for Crawling Eric wrote: Does anyone know if it possible to target only certain links for crawling dynamically during a crawl? My goal would be to write a plugin for this functionality but I don't know where to start. URLFilter plugins may be what you want. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com _ New: Messenger sign-in on the MSN homepage http://go.microsoft.com/?linkid=9677403
Re: Targeting Specific Links for Crawling
Adam, Yes, I have a list of strings I would look for in the link. My plan is to look for X number of links on the site - First looking for the links I want and if they exist, add them, if they don't exist add X links from the site. I am planning to start in the URL Filter plugin. Eric On Oct 5, 2009, at 12:58 PM, BELLINI ADAM wrote: how to target certain links !! do you know how the links are made !? i mean their format ? you can just set a regular expression to accept only those kind of links Date: Mon, 5 Oct 2009 21:39:52 +0200 From: a...@getopt.org To: nutch-user@lucene.apache.org Subject: Re: Targeting Specific Links for Crawling Eric wrote: Does anyone know if it possible to target only certain links for crawling dynamically during a crawl? My goal would be to write a plugin for this functionality but I don't know where to start. URLFilter plugins may be what you want. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com _ New: Messenger sign-in on the MSN homepage http://go.microsoft.com/?linkid=9677403
RE: Targeting Specific Links for Crawling
but when you will start by inject your starting point from your seed...after that nutch will fetch urls and it will bypass those filtred by urlfilter (regular expression)...so to calculate the number X of those URLS you have to crawl all your site !! so for sure if you will not have any regular expression you will have all the links oif your site (with the X needed links), but i guess you wont do that becoz it's a waste of time. i can see just one solutuion is to well set the urlfilter.txt (with the right regular expression). anybody hv other ideas ?? Subject: Re: Targeting Specific Links for Crawling From: e...@lakemeadonline.com Date: Mon, 5 Oct 2009 13:07:25 -0700 To: nutch-user@lucene.apache.org Adam, Yes, I have a list of strings I would look for in the link. My plan is to look for X number of links on the site - First looking for the links I want and if they exist, add them, if they don't exist add X links from the site. I am planning to start in the URL Filter plugin. Eric On Oct 5, 2009, at 12:58 PM, BELLINI ADAM wrote: how to target certain links !! do you know how the links are made !? i mean their format ? you can just set a regular expression to accept only those kind of links Date: Mon, 5 Oct 2009 21:39:52 +0200 From: a...@getopt.org To: nutch-user@lucene.apache.org Subject: Re: Targeting Specific Links for Crawling Eric wrote: Does anyone know if it possible to target only certain links for crawling dynamically during a crawl? My goal would be to write a plugin for this functionality but I don't know where to start. URLFilter plugins may be what you want. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com _ New: Messenger sign-in on the MSN homepage http://go.microsoft.com/?linkid=9677403 _ New! Open Messenger faster on the MSN homepage http://go.microsoft.com/?linkid=9677405