RE: Unable to tell if whether is any changes for the same webpage

2008-05-04 Thread Miao Liqiang NCS
Is this function provided in the nutch package, can I use it directly
without programming the API?

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
Sent: Friday, May 02, 2008 8:12 PM
To: nutch-user@lucene.apache.org
Subject: Re: Unable to tell if whether is any changes for the same
webpage

It uses MD5 of the page content and another method whose exact name I
cannot remember now, but that is more forgiving of small textual
changes.  I think it also takes into consideration the Last-Modified
HTTP response header, but I'd have to check that.
 
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
 From: Miao Liqiang NCS [EMAIL PROTECTED]
 To: nutch-user@lucene.apache.org
 Sent: Friday, May 2, 2008 8:29:38 AM
 Subject: RE: Unable to tell if whether is any changes for the same
webpage
 
 Could you tell me how?
 
 -Original Message-
 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
 Sent: Friday, May 02, 2008 2:12 PM
 To: nutch-user@lucene.apache.org
 Subject: Re: Unable to tell if whether is any changes for the same
 webpage
 
 Hi,
 
 Yes, Nutch can detect when a page changed and when it didn't change.
  
 Otis
 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
 
 - Original Message 
  From: Miao Liqiang NCS 
  To: nutch-user@lucene.apache.org
  Sent: Friday, May 2, 2008 7:48:03 AM
  Subject: Unable to tell if whether is any changes for the same
webpage
  
  Is Nutch able to tell whether there are any changes for the same
  webpage? For example, a webpage has been updated since last
crawling,
 is
  nutch can tell this change of the webpage when there is a
recrawling?
  
  
 


What kind of searches does Nutch support?

2008-05-04 Thread Miao Liqiang NCS
What kind of searches does Nutch support?



Re: Unable to tell if whether is any changes for the same webpage

2008-05-04 Thread ogjunk-nutch
It's part of Nutch, happens automatically.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
 From: Miao Liqiang NCS [EMAIL PROTECTED]
 To: nutch-user@lucene.apache.org
 Sent: Sunday, May 4, 2008 8:33:49 PM
 Subject: RE: Unable to tell if whether is any changes for the same webpage
 
 Is this function provided in the nutch package, can I use it directly
 without programming the API?
 
 -Original Message-
 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
 Sent: Friday, May 02, 2008 8:12 PM
 To: nutch-user@lucene.apache.org
 Subject: Re: Unable to tell if whether is any changes for the same
 webpage
 
 It uses MD5 of the page content and another method whose exact name I
 cannot remember now, but that is more forgiving of small textual
 changes.  I think it also takes into consideration the Last-Modified
 HTTP response header, but I'd have to check that.
  
 Otis
 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
 
 - Original Message 
  From: Miao Liqiang NCS 
  To: nutch-user@lucene.apache.org
  Sent: Friday, May 2, 2008 8:29:38 AM
  Subject: RE: Unable to tell if whether is any changes for the same
 webpage
  
  Could you tell me how?
  
  -Original Message-
  From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
  Sent: Friday, May 02, 2008 2:12 PM
  To: nutch-user@lucene.apache.org
  Subject: Re: Unable to tell if whether is any changes for the same
  webpage
  
  Hi,
  
  Yes, Nutch can detect when a page changed and when it didn't change.
   
  Otis
  --
  Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
  
  - Original Message 
   From: Miao Liqiang NCS 
   To: nutch-user@lucene.apache.org
   Sent: Friday, May 2, 2008 7:48:03 AM
   Subject: Unable to tell if whether is any changes for the same
 webpage
   
   Is Nutch able to tell whether there are any changes for the same
   webpage? For example, a webpage has been updated since last
 crawling,
  is
   nutch can tell this change of the webpage when there is a
 recrawling?
   
   
  
 




RE: Unable to tell if whether is any changes for the same webpage

2008-05-04 Thread Miao Liqiang NCS
In which way the nutch informs there are changes?  Am I able to know
whether there are changes or not? If nutch knows there are changes
internally, can I know that from outside through API or sonethging?

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
Sent: Monday, May 05, 2008 11:32 AM
To: nutch-user@lucene.apache.org
Subject: Re: Unable to tell if whether is any changes for the same
webpage

It's part of Nutch, happens automatically.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
 From: Miao Liqiang NCS [EMAIL PROTECTED]
 To: nutch-user@lucene.apache.org
 Sent: Sunday, May 4, 2008 8:33:49 PM
 Subject: RE: Unable to tell if whether is any changes for the same
webpage
 
 Is this function provided in the nutch package, can I use it directly
 without programming the API?
 
 -Original Message-
 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
 Sent: Friday, May 02, 2008 8:12 PM
 To: nutch-user@lucene.apache.org
 Subject: Re: Unable to tell if whether is any changes for the same
 webpage
 
 It uses MD5 of the page content and another method whose exact name I
 cannot remember now, but that is more forgiving of small textual
 changes.  I think it also takes into consideration the Last-Modified
 HTTP response header, but I'd have to check that.
  
 Otis
 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
 
 - Original Message 
  From: Miao Liqiang NCS 
  To: nutch-user@lucene.apache.org
  Sent: Friday, May 2, 2008 8:29:38 AM
  Subject: RE: Unable to tell if whether is any changes for the same
 webpage
  
  Could you tell me how?
  
  -Original Message-
  From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
  Sent: Friday, May 02, 2008 2:12 PM
  To: nutch-user@lucene.apache.org
  Subject: Re: Unable to tell if whether is any changes for the same
  webpage
  
  Hi,
  
  Yes, Nutch can detect when a page changed and when it didn't change.
   
  Otis
  --
  Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
  
  - Original Message 
   From: Miao Liqiang NCS 
   To: nutch-user@lucene.apache.org
   Sent: Friday, May 2, 2008 7:48:03 AM
   Subject: Unable to tell if whether is any changes for the same
 webpage
   
   Is Nutch able to tell whether there are any changes for the same
   webpage? For example, a webpage has been updated since last
 crawling,
  is
   nutch can tell this change of the webpage when there is a
 recrawling?
   
   
  
 


Re: Nutch API and Lucene API are same?

2008-05-04 Thread Vineet Garg

Hi Otis,

Thanks for the quick response!

[EMAIL PROTECTED] wrote:

Hi Vineet,

No, Nutch API and Lucene API are different.  Nutch does use Lucene for 
indexing/searching, so you *can* use Lucene and its API for searching an index 
you built with Nutch.  Just make sure you use the same version of Lucene that 
Nutch is using.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
  

From: Vineet Garg [EMAIL PROTECTED]
To: nutch-user@lucene.apache.org
Sent: Friday, May 2, 2008 1:17:31 PM
Subject: Nutch API and Lucene API are same?

Hi,

I have crawled and indexed my local file system using nutch 0.9. I want 
to make a web based search application which will  search  in indexed  
local file system.  I want to perform search using Lucene API. Is Lucene 
API and Nutch API same? If not can i use Lucene API to perform search on 
files indexed by nutch 0.9?

Please reply.

Regards,
Vineet Garg






  




Someone Please respond ... Deleting Urls already crawled from the crawlDB

2008-05-04 Thread oddaniel

Guys i have been trying to get this done for weeks now. No progress. Someone
please help me. I am trying to delete a domain already crawled from my
crawldb and index. 

I have a list of domains already crawled in my index. How do I exclude or
delete domains from my crawl output folder. I have tried using the
crawl-urlfilter.txt.

+^http://([a-z0-9]*\.)*
-^http://([a-z0-9]*?\.)*remita.net

Hoping it will exclude the domain remita.net from the crawldb or index and
include all the other urls.  Then I run the LinkDbMerger, SegmentMerger,
CrawlDbMerger, IndexMerger. No change. All domains remain part of my output.

Please how can I get this done.
-- 
View this message in context: 
http://www.nabble.com/Someone-Please-respond-...-Deleting-Urls-already-crawled-from-the-crawlDB-tp17053927p17053927.html
Sent from the Nutch - User mailing list archive at Nabble.com.