Modified date in crawldb
Hi guys, I am using Nutch 0.8.2-dev. I have notice that the crawldb does not actually save the last modified date of files. I have run a crawl on my local file system and the web. When I dumped the content of crawldb for both crawl, the modified date of the files were set to 01-Jan-1970 01:00:00. I don't if it's intended to be as is or if it's a bug. Therefore my question is: * How does the generator knows which file to crawl again? oIs it looking at the fetch time? oThe modified date as this can be misleading? There is a modified date returned in most http headers and files on file system all have modified date which is the last modified date. How come it's not stored in the crawldb? Here is an extract from my 2 crawls: http://dmoz.org/Arts/ Version: 4 Status: 2 (DB_fetched) Fetch time: Thu Feb 22 12:45:43 GMT 2007 Modified time: Thu Jan 01 01:00:00 GMT 1970 Retries since fetch: 0 Retry interval: 30.0 days Score: 0.013471641 Signature: fe52a0bcb1071070689d0f661c168648 Metadata: null file:/C:/TeamBinder/AddressBook/GLOBAL/GLOBAL_fAdrBook_0121.xml Version: 4 Status: 2 (DB_fetched) Fetch time: Sat Feb 24 10:31:44 GMT 2007 Modified time: Thu Jan 01 01:00:00 GMT 1970 Retries since fetch: 0 Retry interval: 30.0 days Score: 1.1035091E-4 Signature: 57254d9ca2988ce1bf7f92b6239d6ebc Metadata: null Looking forward to your reply. Regards, Armel - Armel T. Nene iDNA Solutions Tel: +44 (207) 257 6124 Mobile: +44 (788) 695 0483 http://blog.idna-solutions.com/ http://blog.idna-solutions.com
RE: Modified date in crawldb
Chee, Have you successfully applied Nutch-61 to Nutch 0.8.1. I worked on the version, was able to apply fully but not entirely successful in running with the XML parser plugin. If you have applied successfully let me know. Regards, Armel - Armel T. Nene iDNA Solutions Tel: +44 (207) 257 6124 Mobile: +44 (788) 695 0483 http://blog.idna-solutions.com -Original Message- From: chee wu [mailto:[EMAIL PROTECTED] Sent: 25 January 2007 13:44 To: nutch-dev@lucene.apache.org Subject: Re: Modified date in crawldb I also had this question a few days ago,and I am using Nutch0.8.1.It seems the Modified data will be used by Nutch-61, you can find detail at the link below: http://issues.apache.org/jira/browse/NUTCH-61 I haven't studied this JIRA, and just wrote a simple function to fulfill this. 1.Retrieve all the Date information contained in the page content, Regular Expression is used to identify the date information. 2.Chose the newest date got as the page modified date. 3.Call the method of setModifiedTime( ) of the crawlDataum object in FetcherThread.Output( ). Maybe you can use a parse filter to separate this function from the core code. I am also new to Nutch, if anything wrong ,please feel free point out. - Original Message - From: Armel T. Nene [EMAIL PROTECTED] To: nutch-dev@lucene.apache.org Sent: Thursday, January 25, 2007 7:52 PM Subject: Modified date in crawldb Hi guys, I am using Nutch 0.8.2-dev. I have notice that the crawldb does not actually save the last modified date of files. I have run a crawl on my local file system and the web. When I dumped the content of crawldb for both crawl, the modified date of the files were set to 01-Jan-1970 01:00:00. I don't if it's intended to be as is or if it's a bug. Therefore my question is: * How does the generator knows which file to crawl again? oIs it looking at the fetch time? oThe modified date as this can be misleading? There is a modified date returned in most http headers and files on file system all have modified date which is the last modified date. How come it's not stored in the crawldb? Here is an extract from my 2 crawls: http://dmoz.org/Arts/ Version: 4 Status: 2 (DB_fetched) Fetch time: Thu Feb 22 12:45:43 GMT 2007 Modified time: Thu Jan 01 01:00:00 GMT 1970 Retries since fetch: 0 Retry interval: 30.0 days Score: 0.013471641 Signature: fe52a0bcb1071070689d0f661c168648 Metadata: null file:/C:/TeamBinder/AddressBook/GLOBAL/GLOBAL_fAdrBook_0121.xml Version: 4 Status: 2 (DB_fetched) Fetch time: Sat Feb 24 10:31:44 GMT 2007 Modified time: Thu Jan 01 01:00:00 GMT 1970 Retries since fetch: 0 Retry interval: 30.0 days Score: 1.1035091E-4 Signature: 57254d9ca2988ce1bf7f92b6239d6ebc Metadata: null Looking forward to your reply. Regards, Armel - Armel T. Nene iDNA Solutions Tel: +44 (207) 257 6124 Mobile: +44 (788) 695 0483 http://blog.idna-solutions.com/ http://blog.idna-solutions.com
Re: Modified date in crawldb
Armel T. Nene wrote: Hi guys, I am using Nutch 0.8.2-dev. I have notice that the crawldb does not actually save the last modified date of files. I have run a crawl on my local file system and the web. When I dumped the content of crawldb for both crawl, the modified date of the files were set to 01-Jan-1970 01:00:00. I don't if it's intended to be as is or if it's a bug. Therefore my question is: * How does the generator knows which file to crawl again? oIs it looking at the fetch time? oThe modified date as this can be misleading? There is a modified date returned in most http headers and files on file system all have modified date which is the last modified date. How come it's not stored in the crawldb? This is the issue described in NUTCH-61 - patches from that issue will be applied soon to trunk/ . -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Modified date in crawldb
Armel, Sorry,I haven't tried this patch yet.. - Original Message - From: Armel T. Nene [EMAIL PROTECTED] To: nutch-dev@lucene.apache.org Sent: Thursday, January 25, 2007 11:07 PM Subject: RE: Modified date in crawldb Chee, Have you successfully applied Nutch-61 to Nutch 0.8.1. I worked on the version, was able to apply fully but not entirely successful in running with the XML parser plugin. If you have applied successfully let me know. Regards, Armel - Armel T. Nene iDNA Solutions Tel: +44 (207) 257 6124 Mobile: +44 (788) 695 0483 http://blog.idna-solutions.com -Original Message- From: chee wu [mailto:[EMAIL PROTECTED] Sent: 25 January 2007 13:44 To: nutch-dev@lucene.apache.org Subject: Re: Modified date in crawldb I also had this question a few days ago,and I am using Nutch0.8.1.It seems the Modified data will be used by Nutch-61, you can find detail at the link below: http://issues.apache.org/jira/browse/NUTCH-61 I haven't studied this JIRA, and just wrote a simple function to fulfill this. 1.Retrieve all the Date information contained in the page content, Regular Expression is used to identify the date information. 2.Chose the newest date got as the page modified date. 3.Call the method of setModifiedTime( ) of the crawlDataum object in FetcherThread.Output( ). Maybe you can use a parse filter to separate this function from the core code. I am also new to Nutch, if anything wrong ,please feel free point out. - Original Message - From: Armel T. Nene [EMAIL PROTECTED] To: nutch-dev@lucene.apache.org Sent: Thursday, January 25, 2007 7:52 PM Subject: Modified date in crawldb Hi guys, I am using Nutch 0.8.2-dev. I have notice that the crawldb does not actually save the last modified date of files. I have run a crawl on my local file system and the web. When I dumped the content of crawldb for both crawl, the modified date of the files were set to 01-Jan-1970 01:00:00. I don't if it's intended to be as is or if it's a bug. Therefore my question is: * How does the generator knows which file to crawl again? oIs it looking at the fetch time? oThe modified date as this can be misleading? There is a modified date returned in most http headers and files on file system all have modified date which is the last modified date. How come it's not stored in the crawldb? Here is an extract from my 2 crawls: http://dmoz.org/Arts/ Version: 4 Status: 2 (DB_fetched) Fetch time: Thu Feb 22 12:45:43 GMT 2007 Modified time: Thu Jan 01 01:00:00 GMT 1970 Retries since fetch: 0 Retry interval: 30.0 days Score: 0.013471641 Signature: fe52a0bcb1071070689d0f661c168648 Metadata: null file:/C:/TeamBinder/AddressBook/GLOBAL/GLOBAL_fAdrBook_0121.xml Version: 4 Status: 2 (DB_fetched) Fetch time: Sat Feb 24 10:31:44 GMT 2007 Modified time: Thu Jan 01 01:00:00 GMT 1970 Retries since fetch: 0 Retry interval: 30.0 days Score: 1.1035091E-4 Signature: 57254d9ca2988ce1bf7f92b6239d6ebc Metadata: null Looking forward to your reply. Regards, Armel - Armel T. Nene iDNA Solutions Tel: +44 (207) 257 6124 Mobile: +44 (788) 695 0483 http://blog.idna-solutions.com/ http://blog.idna-solutions.com