Modified date in crawldb

2007-01-25 Thread Armel T. Nene
Hi guys,

 

I am using Nutch 0.8.2-dev. I have notice that the crawldb does not actually
save the last modified date of files. I have run a crawl on my local file
system and the web. When I dumped the content of crawldb for both crawl, the
modified date of the files were set to 01-Jan-1970 01:00:00. I don't if it's
intended to be as is or if it's a bug. Therefore my question is:

 

* How does the generator knows which file to crawl again?

oIs it looking at the fetch time?

oThe modified date as this can be misleading?

 

There is a modified date returned in most http headers and files on file
system all have modified date which is the last modified date. How come it's
not stored in the crawldb?

 

Here is an extract from my 2 crawls:

 

http://dmoz.org/Arts/   Version: 4

Status: 2 (DB_fetched)

Fetch time: Thu Feb 22 12:45:43 GMT 2007

Modified time: Thu Jan 01 01:00:00 GMT 1970

Retries since fetch: 0

Retry interval: 30.0 days

Score: 0.013471641

Signature: fe52a0bcb1071070689d0f661c168648

Metadata: null

 

file:/C:/TeamBinder/AddressBook/GLOBAL/GLOBAL_fAdrBook_0121.xml
Version: 4

Status: 2 (DB_fetched)

Fetch time: Sat Feb 24 10:31:44 GMT 2007

Modified time: Thu Jan 01 01:00:00 GMT 1970

Retries since fetch: 0

Retry interval: 30.0 days

Score: 1.1035091E-4

Signature: 57254d9ca2988ce1bf7f92b6239d6ebc

Metadata: null

 

Looking forward to your reply.

 

Regards,

 

Armel

 

-

Armel T. Nene

iDNA Solutions

Tel: +44 (207) 257 6124

Mobile: +44 (788) 695 0483 

 http://blog.idna-solutions.com/ http://blog.idna-solutions.com

 



RE: Modified date in crawldb

2007-01-25 Thread Armel T. Nene
Chee,

Have you successfully applied Nutch-61 to Nutch 0.8.1. I worked on the
version, was able to apply fully but not entirely successful in running with
the XML parser plugin. If you have applied successfully let me know.

Regards,

Armel 
-
Armel T. Nene
iDNA Solutions
Tel: +44 (207) 257 6124
Mobile: +44 (788) 695 0483 
http://blog.idna-solutions.com

-Original Message-
From: chee wu [mailto:[EMAIL PROTECTED] 
Sent: 25 January 2007 13:44
To: nutch-dev@lucene.apache.org
Subject: Re: Modified date in crawldb

I also had this question a few days ago,and I am using Nutch0.8.1.It seems
the Modified data will be used by Nutch-61, you can find detail at the
link below: 
 http://issues.apache.org/jira/browse/NUTCH-61

I haven't studied this JIRA, and  just  wrote a simple function  to fulfill
this.
1.Retrieve all the Date information contained in the page content, Regular
Expression is used to identify the date information.
2.Chose the newest date got as the page modified date.
3.Call  the method of  setModifiedTime( )  of the crawlDataum object in
FetcherThread.Output( ).
Maybe you can use a parse filter to separate this function from the core
code.
I am also new to Nutch, if  anything  wrong ,please feel free point out.


- Original Message - 
From: Armel T. Nene [EMAIL PROTECTED]
To: nutch-dev@lucene.apache.org
Sent: Thursday, January 25, 2007 7:52 PM
Subject: Modified date in crawldb


 Hi guys,
 
 
 
 I am using Nutch 0.8.2-dev. I have notice that the crawldb does not
actually
 save the last modified date of files. I have run a crawl on my local file
 system and the web. When I dumped the content of crawldb for both crawl,
the
 modified date of the files were set to 01-Jan-1970 01:00:00. I don't if
it's
 intended to be as is or if it's a bug. Therefore my question is:
 
 
 
 * How does the generator knows which file to crawl again?
 
 oIs it looking at the fetch time?
 
 oThe modified date as this can be misleading?
 
 
 
 There is a modified date returned in most http headers and files on file
 system all have modified date which is the last modified date. How come
it's
 not stored in the crawldb?
 
 
 
 Here is an extract from my 2 crawls:
 
 
 
 http://dmoz.org/Arts/   Version: 4
 
 Status: 2 (DB_fetched)
 
 Fetch time: Thu Feb 22 12:45:43 GMT 2007
 
 Modified time: Thu Jan 01 01:00:00 GMT 1970
 
 Retries since fetch: 0
 
 Retry interval: 30.0 days
 
 Score: 0.013471641
 
 Signature: fe52a0bcb1071070689d0f661c168648
 
 Metadata: null
 
 
 
 file:/C:/TeamBinder/AddressBook/GLOBAL/GLOBAL_fAdrBook_0121.xml
 Version: 4
 
 Status: 2 (DB_fetched)
 
 Fetch time: Sat Feb 24 10:31:44 GMT 2007
 
 Modified time: Thu Jan 01 01:00:00 GMT 1970
 
 Retries since fetch: 0
 
 Retry interval: 30.0 days
 
 Score: 1.1035091E-4
 
 Signature: 57254d9ca2988ce1bf7f92b6239d6ebc
 
 Metadata: null
 
 
 
 Looking forward to your reply.
 
 
 
 Regards,
 
 
 
 Armel
 
 
 
 -
 
 Armel T. Nene
 
 iDNA Solutions
 
 Tel: +44 (207) 257 6124
 
 Mobile: +44 (788) 695 0483 
 
 http://blog.idna-solutions.com/ http://blog.idna-solutions.com
 
 
 




Re: Modified date in crawldb

2007-01-25 Thread Andrzej Bialecki

Armel T. Nene wrote:

Hi guys,

 


I am using Nutch 0.8.2-dev. I have notice that the crawldb does not actually
save the last modified date of files. I have run a crawl on my local file
system and the web. When I dumped the content of crawldb for both crawl, the
modified date of the files were set to 01-Jan-1970 01:00:00. I don't if it's
intended to be as is or if it's a bug. Therefore my question is:

 


* How does the generator knows which file to crawl again?

oIs it looking at the fetch time?

oThe modified date as this can be misleading?

 


There is a modified date returned in most http headers and files on file
system all have modified date which is the last modified date. How come it's
not stored in the crawldb?

  


This is the issue described in NUTCH-61 - patches from that issue will 
be applied soon to trunk/ .


--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




Re: Modified date in crawldb

2007-01-25 Thread chee wu
Armel,
   Sorry,I haven't tried this patch yet..

- Original Message - 
From: Armel T. Nene [EMAIL PROTECTED]
To: nutch-dev@lucene.apache.org
Sent: Thursday, January 25, 2007 11:07 PM
Subject: RE: Modified date in crawldb


 Chee,
 
 Have you successfully applied Nutch-61 to Nutch 0.8.1. I worked on the
 version, was able to apply fully but not entirely successful in running with
 the XML parser plugin. If you have applied successfully let me know.
 
 Regards,
 
 Armel 
 -
 Armel T. Nene
 iDNA Solutions
 Tel: +44 (207) 257 6124
 Mobile: +44 (788) 695 0483 
 http://blog.idna-solutions.com
 
 -Original Message-
 From: chee wu [mailto:[EMAIL PROTECTED] 
 Sent: 25 January 2007 13:44
 To: nutch-dev@lucene.apache.org
 Subject: Re: Modified date in crawldb
 
 I also had this question a few days ago,and I am using Nutch0.8.1.It seems
 the Modified data will be used by Nutch-61, you can find detail at the
 link below: 
 http://issues.apache.org/jira/browse/NUTCH-61
 
 I haven't studied this JIRA, and  just  wrote a simple function  to fulfill
 this.
 1.Retrieve all the Date information contained in the page content, Regular
 Expression is used to identify the date information.
 2.Chose the newest date got as the page modified date.
 3.Call  the method of  setModifiedTime( )  of the crawlDataum object in
 FetcherThread.Output( ).
 Maybe you can use a parse filter to separate this function from the core
 code.
 I am also new to Nutch, if  anything  wrong ,please feel free point out.
 
 
 - Original Message - 
 From: Armel T. Nene [EMAIL PROTECTED]
 To: nutch-dev@lucene.apache.org
 Sent: Thursday, January 25, 2007 7:52 PM
 Subject: Modified date in crawldb
 
 
 Hi guys,
 
 
 
 I am using Nutch 0.8.2-dev. I have notice that the crawldb does not
 actually
 save the last modified date of files. I have run a crawl on my local file
 system and the web. When I dumped the content of crawldb for both crawl,
 the
 modified date of the files were set to 01-Jan-1970 01:00:00. I don't if
 it's
 intended to be as is or if it's a bug. Therefore my question is:
 
 
 
 * How does the generator knows which file to crawl again?
 
 oIs it looking at the fetch time?
 
 oThe modified date as this can be misleading?
 
 
 
 There is a modified date returned in most http headers and files on file
 system all have modified date which is the last modified date. How come
 it's
 not stored in the crawldb?
 
 
 
 Here is an extract from my 2 crawls:
 
 
 
 http://dmoz.org/Arts/   Version: 4
 
 Status: 2 (DB_fetched)
 
 Fetch time: Thu Feb 22 12:45:43 GMT 2007
 
 Modified time: Thu Jan 01 01:00:00 GMT 1970
 
 Retries since fetch: 0
 
 Retry interval: 30.0 days
 
 Score: 0.013471641
 
 Signature: fe52a0bcb1071070689d0f661c168648
 
 Metadata: null
 
 
 
 file:/C:/TeamBinder/AddressBook/GLOBAL/GLOBAL_fAdrBook_0121.xml
 Version: 4
 
 Status: 2 (DB_fetched)
 
 Fetch time: Sat Feb 24 10:31:44 GMT 2007
 
 Modified time: Thu Jan 01 01:00:00 GMT 1970
 
 Retries since fetch: 0
 
 Retry interval: 30.0 days
 
 Score: 1.1035091E-4
 
 Signature: 57254d9ca2988ce1bf7f92b6239d6ebc
 
 Metadata: null
 
 
 
 Looking forward to your reply.
 
 
 
 Regards,
 
 
 
 Armel
 
 
 
 -
 
 Armel T. Nene
 
 iDNA Solutions
 
 Tel: +44 (207) 257 6124
 
 Mobile: +44 (788) 695 0483 
 
 http://blog.idna-solutions.com/ http://blog.idna-solutions.com