RE: Unable to tell if whether is any changes for the same webpage
Is this function provided in the nutch package, can I use it directly without programming the API? -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: Friday, May 02, 2008 8:12 PM To: nutch-user@lucene.apache.org Subject: Re: Unable to tell if whether is any changes for the same webpage It uses MD5 of the page content and another method whose exact name I cannot remember now, but that is more forgiving of small textual changes. I think it also takes into consideration the Last-Modified HTTP response header, but I'd have to check that. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Miao Liqiang NCS [EMAIL PROTECTED] To: nutch-user@lucene.apache.org Sent: Friday, May 2, 2008 8:29:38 AM Subject: RE: Unable to tell if whether is any changes for the same webpage Could you tell me how? -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: Friday, May 02, 2008 2:12 PM To: nutch-user@lucene.apache.org Subject: Re: Unable to tell if whether is any changes for the same webpage Hi, Yes, Nutch can detect when a page changed and when it didn't change. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Miao Liqiang NCS To: nutch-user@lucene.apache.org Sent: Friday, May 2, 2008 7:48:03 AM Subject: Unable to tell if whether is any changes for the same webpage Is Nutch able to tell whether there are any changes for the same webpage? For example, a webpage has been updated since last crawling, is nutch can tell this change of the webpage when there is a recrawling?
What kind of searches does Nutch support?
What kind of searches does Nutch support?
Re: Unable to tell if whether is any changes for the same webpage
It's part of Nutch, happens automatically. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Miao Liqiang NCS [EMAIL PROTECTED] To: nutch-user@lucene.apache.org Sent: Sunday, May 4, 2008 8:33:49 PM Subject: RE: Unable to tell if whether is any changes for the same webpage Is this function provided in the nutch package, can I use it directly without programming the API? -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: Friday, May 02, 2008 8:12 PM To: nutch-user@lucene.apache.org Subject: Re: Unable to tell if whether is any changes for the same webpage It uses MD5 of the page content and another method whose exact name I cannot remember now, but that is more forgiving of small textual changes. I think it also takes into consideration the Last-Modified HTTP response header, but I'd have to check that. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Miao Liqiang NCS To: nutch-user@lucene.apache.org Sent: Friday, May 2, 2008 8:29:38 AM Subject: RE: Unable to tell if whether is any changes for the same webpage Could you tell me how? -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: Friday, May 02, 2008 2:12 PM To: nutch-user@lucene.apache.org Subject: Re: Unable to tell if whether is any changes for the same webpage Hi, Yes, Nutch can detect when a page changed and when it didn't change. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Miao Liqiang NCS To: nutch-user@lucene.apache.org Sent: Friday, May 2, 2008 7:48:03 AM Subject: Unable to tell if whether is any changes for the same webpage Is Nutch able to tell whether there are any changes for the same webpage? For example, a webpage has been updated since last crawling, is nutch can tell this change of the webpage when there is a recrawling?
RE: Unable to tell if whether is any changes for the same webpage
In which way the nutch informs there are changes? Am I able to know whether there are changes or not? If nutch knows there are changes internally, can I know that from outside through API or sonethging? -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: Monday, May 05, 2008 11:32 AM To: nutch-user@lucene.apache.org Subject: Re: Unable to tell if whether is any changes for the same webpage It's part of Nutch, happens automatically. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Miao Liqiang NCS [EMAIL PROTECTED] To: nutch-user@lucene.apache.org Sent: Sunday, May 4, 2008 8:33:49 PM Subject: RE: Unable to tell if whether is any changes for the same webpage Is this function provided in the nutch package, can I use it directly without programming the API? -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: Friday, May 02, 2008 8:12 PM To: nutch-user@lucene.apache.org Subject: Re: Unable to tell if whether is any changes for the same webpage It uses MD5 of the page content and another method whose exact name I cannot remember now, but that is more forgiving of small textual changes. I think it also takes into consideration the Last-Modified HTTP response header, but I'd have to check that. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Miao Liqiang NCS To: nutch-user@lucene.apache.org Sent: Friday, May 2, 2008 8:29:38 AM Subject: RE: Unable to tell if whether is any changes for the same webpage Could you tell me how? -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: Friday, May 02, 2008 2:12 PM To: nutch-user@lucene.apache.org Subject: Re: Unable to tell if whether is any changes for the same webpage Hi, Yes, Nutch can detect when a page changed and when it didn't change. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Miao Liqiang NCS To: nutch-user@lucene.apache.org Sent: Friday, May 2, 2008 7:48:03 AM Subject: Unable to tell if whether is any changes for the same webpage Is Nutch able to tell whether there are any changes for the same webpage? For example, a webpage has been updated since last crawling, is nutch can tell this change of the webpage when there is a recrawling?
Re: Nutch API and Lucene API are same?
Hi Otis, Thanks for the quick response! [EMAIL PROTECTED] wrote: Hi Vineet, No, Nutch API and Lucene API are different. Nutch does use Lucene for indexing/searching, so you *can* use Lucene and its API for searching an index you built with Nutch. Just make sure you use the same version of Lucene that Nutch is using. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Vineet Garg [EMAIL PROTECTED] To: nutch-user@lucene.apache.org Sent: Friday, May 2, 2008 1:17:31 PM Subject: Nutch API and Lucene API are same? Hi, I have crawled and indexed my local file system using nutch 0.9. I want to make a web based search application which will search in indexed local file system. I want to perform search using Lucene API. Is Lucene API and Nutch API same? If not can i use Lucene API to perform search on files indexed by nutch 0.9? Please reply. Regards, Vineet Garg
Someone Please respond ... Deleting Urls already crawled from the crawlDB
Guys i have been trying to get this done for weeks now. No progress. Someone please help me. I am trying to delete a domain already crawled from my crawldb and index. I have a list of domains already crawled in my index. How do I exclude or delete domains from my crawl output folder. I have tried using the crawl-urlfilter.txt. +^http://([a-z0-9]*\.)* -^http://([a-z0-9]*?\.)*remita.net Hoping it will exclude the domain remita.net from the crawldb or index and include all the other urls. Then I run the LinkDbMerger, SegmentMerger, CrawlDbMerger, IndexMerger. No change. All domains remain part of my output. Please how can I get this done. -- View this message in context: http://www.nabble.com/Someone-Please-respond-...-Deleting-Urls-already-crawled-from-the-crawlDB-tp17053927p17053927.html Sent from the Nutch - User mailing list archive at Nabble.com.