RE: Document Classification - indexing question
Bastian, When trying to classify document using the approach of dynamic classification, depending on the file type Nutch can take a awhile to parse the data. While working with Nutch I have encountered some null pointer exception due to parsing processes. This is due to a Hadoop configuration that was not made available in Nutch-default.xml file. The settings should allow Nutch to increase the time that hadoop have to wait before setting a process as inactive. Some questions that you should investigate is how will your classification process handles failed parsed and what about if the data is not parsed in a text format (i.e. unsupported file type)? What happens to the index being created if the classification fails; corrupted? In a multithreaded environment such as Nutch, what happens to the concurrent classification processes, mixed up data? I have a problem with Nutch now it seems not to be able to generate dynamic fields based on documents while using more than a single threads. The index becomes corrupted with mixed data from different files in the wrong Lucene document. There are many other questions once you start to work on your classification project. Best regards Armel -Original Message- From: Bastian Preindl [mailto:[EMAIL PROTECTED] Sent: 08 May 2007 13:38 To: nutch-dev@lucene.apache.org Subject: Re: Document Classification - indexing question Hi Armel, thanks for you quick reply! I have been working on a similar project for the last couple of months but I am taking a slightly different approach. Because fetching - parsing - indexing can be time consuming and in my case, I also need the unclassified indexes. Using classification algorithm and the Lucene API, I build classified indexes by using the first index as corpus. This is definitely a good idea and a somewhat other approach as it moves the classification task out of Nutch and into Lucene. Are there any frameworks/plugins already available for applying document classification within Lucene? The much faster parsing and indexing process within Nutch if no online classification takes places stands against the disk space consumption which is some thousand times greater when indexing all parsed documents instead of indexing only the positively classified ones. Maybe we should discuss together on skype or MSN let me know. My skype is etapix. That would be really nice, thanks for the offer! I'll let you know my MSN-nummer after I've created an account. Best regards Bastian
Nutch ERROR parse.OutlinkExtractor - getOutlinks
Hi guys, I have been running successfully recently with most of the plug-ins enabled. Lately, I have been trying to index some xml files which has some strings in the form of ftawi:xyz. Nutch version 8.2-dev on MS Windows Server 2003 During Outlinks extractor I get the following errors: 2007-04-17 21:52:51,598 ERROR parse.OutlinkExtractor - getOutlinks java.net.MalformedURLException: unknown protocol: ftawi at java.net.URL.init(Unknown Source) at java.net.URL.init(Unknown Source) at java.net.URL.init(Unknown Source) at org.apache.nutch.net.BasicUrlNormalizer.normalize(BasicUrlNormalizer.java:78 ) at org.apache.nutch.parse.Outlink.init(Outlink.java:35) at org.apache.nutch.parse.OutlinkExtractor.getOutlinks(OutlinkExtractor.java:11 1) at org.apache.nutch.parse.OutlinkExtractor.getOutlinks(OutlinkExtractor.java:70 ) at org.apache.nutch.parse.stellent.StellentParser.getParse(StellentParser.java: 53) at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82) at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:283) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:152) I get the same error with all the parser plug-ins when running over the same xml files. Can you let me know if there is a way of using the regular expression to let the application know what kind of url should be included in the url. Also, Nutch should not crash if the url in the outlink is not valid. Is there any other HTML parser in Nutch that I can try. Awaiting your kind reply. Regards, Armel === Armel T. Nene iDNA Solutions LTD Tel: +44 (20) 7257 6124 Mobile: +44 (7886)950 483 Web: http://www.idna-solutions.com http://www.idna-solutions.com Blog: http://blog.idna-solutions.com http://blog.idna-solutions.com
Nutch java.io.exception
crawl.Injector - Injector: done 2007-04-05 16:35:34,439 INFO crawl.Generator - topN: 100 2007-04-05 16:35:34,439 DEBUG conf.Configuration - java.io.IOException: config() at org.apache.hadoop.conf.Configuration.init(Configuration.java:67) at org.apache.nutch.util.NutchConfiguration.create(NutchConfiguration.java:50) at org.apache.nutch.crawl.Generator.main(Generator.java:416) at com.idna.nutch.launcher.CrawlerManager.autoGenSegList(CrawlerManager.java:80 ) at com.idna.nutch.launcher.CrawlerManager.main(CrawlerManager.java:211) 2007-04-05 16:35:34,443 INFO conf.Configuration - parsing jar:file:/E:/iDna-nutch-RC1/nutch-0.8.2-dev/lib/hadoop-0.4.0-patched.jar!/ha doop-default.xml 2007-04-05 16:35:34,450 INFO conf.Configuration - parsing file:/E:/iDna-nutch-RC1/iDna-nutch-launcher/test/conf/nutch-default.xml 2007-04-05 16:35:34,462 INFO conf.Configuration - parsing file:/E:/iDna-nutch-RC1/iDna-nutch-launcher/test/conf/nutch-site.xml 2007-04-05 16:35:34,468 INFO conf.Configuration - parsing file:/E:/iDna-nutch-RC1/iDna-nutch-launcher/test/conf/hadoop-site.xml 2007-04-05 16:35:35,470 INFO crawl.Generator - Generator: starting 2007-04-05 16:35:35,470 INFO crawl.Generator - Generator: segment: test/segments/20070405163535 2007-04-05 16:35:35,470 INFO crawl.Generator - Generator: Selecting best-scoring urls due for fetch. 2007-04-05 16:35:35,471 DEBUG conf.Configuration - java.io.IOException: config(config) at org.apache.hadoop.conf.Configuration.init(Configuration.java:76) at org.apache.hadoop.mapred.JobConf.init(JobConf.java:86) at org.apache.hadoop.mapred.JobConf.init(JobConf.java:97) at org.apache.nutch.util.NutchJob.init(NutchJob.java:26) at org.apache.nutch.crawl.Generator.generate(Generator.java:309) at org.apache.nutch.crawl.Generator.main(Generator.java:417) at com.idna.nutch.launcher.CrawlerManager.autoGenSegList(CrawlerManager.java:80 ) at com.idna.nutch.launcher.CrawlerManager.main(CrawlerManager.java:211) === Armel T. Nene iDNA Solutions LTD Tel: +44 (20) 7257 6124 Mobile: +44 (7886)950 483 Web: http://www.idna-solutions.com Blog: http://blog.idna-solutions.com
RE: NPE in org.apache.hadoop.io.SequenceFile$Sorter$MergeQueue
Dennis I was wondering if this patch could fix my problem which is, if not the same, very similar to this one. I am using Nutch 0.8.2-dev, I have made checkout awhile ago from SVN but never updated again. I was able to crawl 1 xml files before with no error whatsoever. This is the following errors that I get when I'm fetching: INFO parser.custom: Custom-parse: Parsing content file:/C:/TeamBinder/AddressBook/9100/(65)E110_ST A0 (1).pdf 07/02/12 22:09:16 INFO fetcher.Fetcher: fetch of file:/C:/TeamBinder/AddressBook/9100/(65)E110_ST A0 (1).pdf failed with: java.lang.NullPointerException 07/02/12 22:09:17 INFO mapred.LocalJobRunner: 0 pages, 0 errors, 0.0 pages/s, 0 kb/s, 07/02/12 22:09:17 FATAL fetcher.Fetcher: java.lang.NullPointerException 07/02/12 22:09:17 FATAL fetcher.Fetcher: at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:198) 07/02/12 22:09:17 FATAL fetcher.Fetcher: at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:189) 07/02/12 22:09:17 FATAL fetcher.Fetcher: at org.apache.hadoop.mapred.MapTask$2.collect(MapTask.java:91) 07/02/12 22:09:17 FATAL fetcher.Fetcher: at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:314) 07/02/12 22:09:17 FATAL fetcher.Fetcher: at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:232) 07/02/12 22:09:17 FATAL fetcher.Fetcher: fetcher caught:java.lang.NullPointerException One of the problem is that my hadoop version says the following: hadoop-0.4.0-patched. Now I don't know if it means that I am running the 0.4.0 version but it seems a little bit confusing. Once you can clarify that for me, then I will be able to apply the patch to my version. Best Regards, Armel -Original Message- From: Dennis Kubes [mailto:[EMAIL PROTECTED] Sent: 13 February 2007 21:09 To: nutch-dev@lucene.apache.org Subject: Re: NPE in org.apache.hadoop.io.SequenceFile$Sorter$MergeQueue Actually I take it back. I don't think it is the same problem but I do think it is the right solution. Dennis Kubes Dennis Kubes wrote: This has to do with HADOOP-964. Replace the jar files in your Nutch versions with the most recent versions from Hadoop. You will also need to apply NUTCH-437 patch to get Nutch to work with the most recent changes to the Hadoop codebase. Dennis Kubes Gal Nitzan wrote: Hi, Does anybody uses Nutch trunk? I am running nutch 0.9 and unable to fetch. after 50-60K urls I get NPE in org.apache.hadoop.io.SequenceFile$Sorter$MergeQueue every time. I was wandering if anyone have a work around or maybe something is wrong with my setup. I have opened a new issue in jira http://issues.apache.org/jira/browse/hadoop-1008 for this. Any clue? Gal -- No virus found in this incoming message. Checked by AVG Free Edition. Version: 7.5.441 / Virus Database: 268.17.37/682 - Release Date: 12/02/2007 13:23 -- No virus found in this outgoing message. Checked by AVG Free Edition. Version: 7.5.441 / Virus Database: 268.17.37/682 - Release Date: 12/02/2007 13:23
Nutch error messages
Hi guys, I wrote a parser for parsing proprietary file formats. The plugin used to work until recently. Now when I try to parse simple CAD files I get the following error messages: INFO fetcher.Fetcher - fetching file:/H:/businessDNA/External/BDP/P1109173/M_Drive/QTRAK/Attachments/(00)E9~161394764(1).PDF WARN fetcher.Fetcher - Error parsing: file:/H:/businessDNA/External/BDP/P1109173/M_Drive/QTRAK/Attachments/(00)E9~161394764(1).PDF: failed(2,200): java.lang.NullPointerException There are some debug lines in the parser but they don't get log in the log file. Also when I set the log level to DEBUG, I have the following messages: INFO fetcher.Fetcher - fetching file:/H:/businessDNA/External/BDP/P1109173/M_Drive/QTRAK/Attachments/15186-1A(1).dwg DEBUG file.File - fetching file:/H:/businessDNA/External/BDP/P1109173/M_Drive/QTRAK/Attachments/15186-1A(1).dwg DEBUG parse.ParserFactory - Could not clean the content-type [], Reason is [org.apache.nutch.util.mime.MimeTypeException: The type can not be null or empty]. Using its raw version... DEBUG parse.ParserFactory - ParserFactory:No parse plugins mapped or enabled for contentType DEBUG parse.ParseUtil - Parsing [file:/H:/businessDNA/External/BDP/P1109173/M_Drive/QTRAK/Attachments/15186-1A(1).dwg] with [EMAIL PROTECTED] WARN fetcher.Fetcher - Error parsing: file:/H:/businessDNA/External/BDP/P1109173/M_Drive/QTRAK/Attachments/15186-1A(1).dwg: failed(2,200): java.lang.NullPointerException If anybody can make sense of the errors please guide me on this. Also, I have disabled most of Nutch parsers to use my custom as it can parse many formats. I am awaiting any help from the community. Regards, Armel
RE: Fetcher2
Kauu, The url for fetcher too is: https://issues.apache.org/jira/browse/NUTCH-339 Armel - Armel T. Nene iDNA Solutions Tel: +44 (207) 257 6124 Mobile: +44 (788) 695 0483 http://blog.idna-solutions.com -Original Message- From: kauu [mailto:[EMAIL PROTECTED] Sent: 25 January 2007 09:31 To: nutch-dev@lucene.apache.org Subject: Re: Fetcher2 please give us the url,thx On 1/25/07, chee wu [EMAIL PROTECTED] wrote: Just appended the portion for .81 to NUTCH-339 - Original Message - From: Armel T. Nene [EMAIL PROTECTED] To: nutch-dev@lucene.apache.org Sent: Thursday, January 25, 2007 8:06 AM Subject: RE: Fetcher2 Chee, Can you make the code available through Jira. Thanks, Armel - Armel T. Nene iDNA Solutions Tel: +44 (207) 257 6124 Mobile: +44 (788) 695 0483 http://blog.idna-solutions.com -Original Message- From: chee wu [mailto:[EMAIL PROTECTED] Sent: 24 January 2007 03:59 To: nutch-dev@lucene.apache.org Subject: Re: Fetcher2 Thanks! I successfully port Fetcher2 to Nutch.81, it's prettyly easy... I can share the code,if any one want to use .. - Original Message - From: Andrzej Bialecki [EMAIL PROTECTED] To: nutch-dev@lucene.apache.org Sent: Tuesday, January 23, 2007 12:09 AM Subject: Re: Fetcher2 chee wu wrote: Fetcher2 should be a great help for me,but seems can't integrate with Nutch81. Any advice on how to use it based on .81? You would have to port it to Nutch 0.8.1 - e.g. change all Text occurences to UTF8, and most likely make other changes too ... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com -- www.babatu.com
Modified date in crawldb
Hi guys, I am using Nutch 0.8.2-dev. I have notice that the crawldb does not actually save the last modified date of files. I have run a crawl on my local file system and the web. When I dumped the content of crawldb for both crawl, the modified date of the files were set to 01-Jan-1970 01:00:00. I don't if it's intended to be as is or if it's a bug. Therefore my question is: * How does the generator knows which file to crawl again? oIs it looking at the fetch time? oThe modified date as this can be misleading? There is a modified date returned in most http headers and files on file system all have modified date which is the last modified date. How come it's not stored in the crawldb? Here is an extract from my 2 crawls: http://dmoz.org/Arts/ Version: 4 Status: 2 (DB_fetched) Fetch time: Thu Feb 22 12:45:43 GMT 2007 Modified time: Thu Jan 01 01:00:00 GMT 1970 Retries since fetch: 0 Retry interval: 30.0 days Score: 0.013471641 Signature: fe52a0bcb1071070689d0f661c168648 Metadata: null file:/C:/TeamBinder/AddressBook/GLOBAL/GLOBAL_fAdrBook_0121.xml Version: 4 Status: 2 (DB_fetched) Fetch time: Sat Feb 24 10:31:44 GMT 2007 Modified time: Thu Jan 01 01:00:00 GMT 1970 Retries since fetch: 0 Retry interval: 30.0 days Score: 1.1035091E-4 Signature: 57254d9ca2988ce1bf7f92b6239d6ebc Metadata: null Looking forward to your reply. Regards, Armel - Armel T. Nene iDNA Solutions Tel: +44 (207) 257 6124 Mobile: +44 (788) 695 0483 http://blog.idna-solutions.com/ http://blog.idna-solutions.com
RE: Modified date in crawldb
Chee, Have you successfully applied Nutch-61 to Nutch 0.8.1. I worked on the version, was able to apply fully but not entirely successful in running with the XML parser plugin. If you have applied successfully let me know. Regards, Armel - Armel T. Nene iDNA Solutions Tel: +44 (207) 257 6124 Mobile: +44 (788) 695 0483 http://blog.idna-solutions.com -Original Message- From: chee wu [mailto:[EMAIL PROTECTED] Sent: 25 January 2007 13:44 To: nutch-dev@lucene.apache.org Subject: Re: Modified date in crawldb I also had this question a few days ago,and I am using Nutch0.8.1.It seems the Modified data will be used by Nutch-61, you can find detail at the link below: http://issues.apache.org/jira/browse/NUTCH-61 I haven't studied this JIRA, and just wrote a simple function to fulfill this. 1.Retrieve all the Date information contained in the page content, Regular Expression is used to identify the date information. 2.Chose the newest date got as the page modified date. 3.Call the method of setModifiedTime( ) of the crawlDataum object in FetcherThread.Output( ). Maybe you can use a parse filter to separate this function from the core code. I am also new to Nutch, if anything wrong ,please feel free point out. - Original Message - From: Armel T. Nene [EMAIL PROTECTED] To: nutch-dev@lucene.apache.org Sent: Thursday, January 25, 2007 7:52 PM Subject: Modified date in crawldb Hi guys, I am using Nutch 0.8.2-dev. I have notice that the crawldb does not actually save the last modified date of files. I have run a crawl on my local file system and the web. When I dumped the content of crawldb for both crawl, the modified date of the files were set to 01-Jan-1970 01:00:00. I don't if it's intended to be as is or if it's a bug. Therefore my question is: * How does the generator knows which file to crawl again? oIs it looking at the fetch time? oThe modified date as this can be misleading? There is a modified date returned in most http headers and files on file system all have modified date which is the last modified date. How come it's not stored in the crawldb? Here is an extract from my 2 crawls: http://dmoz.org/Arts/ Version: 4 Status: 2 (DB_fetched) Fetch time: Thu Feb 22 12:45:43 GMT 2007 Modified time: Thu Jan 01 01:00:00 GMT 1970 Retries since fetch: 0 Retry interval: 30.0 days Score: 0.013471641 Signature: fe52a0bcb1071070689d0f661c168648 Metadata: null file:/C:/TeamBinder/AddressBook/GLOBAL/GLOBAL_fAdrBook_0121.xml Version: 4 Status: 2 (DB_fetched) Fetch time: Sat Feb 24 10:31:44 GMT 2007 Modified time: Thu Jan 01 01:00:00 GMT 1970 Retries since fetch: 0 Retry interval: 30.0 days Score: 1.1035091E-4 Signature: 57254d9ca2988ce1bf7f92b6239d6ebc Metadata: null Looking forward to your reply. Regards, Armel - Armel T. Nene iDNA Solutions Tel: +44 (207) 257 6124 Mobile: +44 (788) 695 0483 http://blog.idna-solutions.com/ http://blog.idna-solutions.com
threads-safe methods in Nutch
Hi guys, I know it's me again. I have been testing Nutch robustly lately and here some threads issues that I found. I am running version 0.8.2-dev. When Nutch is initially run (either from script or ANT), it has a default of 10 threads for the fetcher. This is actually good for performance reason as large number of urls can be indexed fast enough. The problem is some plugins are not thread safe (or is it the fetcher that's not thread-safe). I am running the parse-xml plugin (Nutch-185) and some issues: When running multiple threads such as the default 10 threads, I have some inconsistency with the stored fields and values. I found out the first 6 documents will be indexed without problem and then 4 with errors, 4 correct and x numbers with errors and so forth. At first I couldn't see where the problem was, and after several debugging activities, I realize that it could be a threading issue. I run Nutch with the minimum threading of 1 and the fields were stored without any issues. I don't know how to conclude this but I think that the methods that Nutch uses for threading are not thread safe. I could be wrong therefore I am awaiting any reply. Regards, Armel - Armel T. Nene iDNA Solutions Tel: +44 (207) 257 6124 Mobile: +44 (788) 695 0483 http://blog.idna-solutions.com/ http://blog.idna-solutions.com
RE: Fetcher2
Chee, Can you make the code available through Jira. Thanks, Armel - Armel T. Nene iDNA Solutions Tel: +44 (207) 257 6124 Mobile: +44 (788) 695 0483 http://blog.idna-solutions.com -Original Message- From: chee wu [mailto:[EMAIL PROTECTED] Sent: 24 January 2007 03:59 To: nutch-dev@lucene.apache.org Subject: Re: Fetcher2 Thanks! I successfully port Fetcher2 to Nutch.81, it's prettyly easy... I can share the code,if any one want to use .. - Original Message - From: Andrzej Bialecki [EMAIL PROTECTED] To: nutch-dev@lucene.apache.org Sent: Tuesday, January 23, 2007 12:09 AM Subject: Re: Fetcher2 chee wu wrote: Fetcher2 should be a great help for me,but seems can't integrate with Nutch81. Any advice on how to use it based on .81? You would have to port it to Nutch 0.8.1 - e.g. change all Text occurences to UTF8, and most likely make other changes too ... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
How to modify crawldb values
Hi guys, I want to extend Nutch to use real-time indexing on local file system. I have been through the source code to find out ways to modify values stored in CrawlDB. The idea is simple: I have an external program (or a script) which checks for changes in a directory (url injected in the crawldb). When there are new changes recorded, the program will update the status in the crawldb and generate a new fetch list for the fetcher to fetch. I do not want to make great changes to the nutch source code as I want the program to be compatible with future releases. Now, I know the crawldatum is saved in the crawldb with the url. I am not too sure but I think the url is the key to retrieve the crawldatum. For my program to work successfully, I need to know the following: * How to read data from the crawldb; what data structure does it use and how to referenced to it? * How to write back to the crawldb; updating information back to the crawldb or probably creating a new with changed and unchanged values. This is an extract from the crawldb: http://some-url.com/Version: 4 Status: 2 (DB_fetched) Fetch time: Thu Feb 22 12:44:05 GMT 2007 Modified time: Thu Jan 01 01:00:00 GMT 1970 Retries since fetch: 0 Retry interval: 30.0 days Score: 1.0323955 Signature: f4c14c46074b66aad8829b8aa84cd636 Metadata: null How can get this information with an external program and modify/ update it. Once I know how to implement that part, I can call nutch in the usual way of generate - fetch - updatedb - updatelinkdb -index -etc.. so generate will have the new value that I want re-indexed. This will stop the fetcher from fetching a long list of urls (changed or unchanged but need fetching because of their next_fetch_time is due). The program gets its update from the underlying OS to know notify about any changes to files and folders being monitored. Once the program is working with sufficient tests, I will be willing to share the source code; it's written in java and doesn't need any script to launch nutch. I will be looking forward to your kind support. Armel - Armel T. Nene iDNA Solutions Tel: +44 (207) 257 6124 Mobile: +44 (788) 695 0483 http://blog.idna-solutions.com/ http://blog.idna-solutions.com
RE: How to modify crawldb values
Thanks for the reply, I 'll try this and if I encounter any problem I'll send another email. This will be a good feature to have and probably will avoid the project into branching in different subprojects. Regards, Armel - Armel T. Nene iDNA Solutions Tel: +44 (207) 257 6124 Mobile: +44 (788) 695 0483 http://blog.idna-solutions.com -Original Message- From: Doğacan Güney [mailto:[EMAIL PROTECTED] Sent: 23 January 2007 15:06 To: nutch-dev@lucene.apache.org Subject: Re: How to modify crawldb values Hi, Armel T. Nene wrote: Hi guys, I want to extend Nutch to use real-time indexing on local file system. I have been through the source code to find out ways to modify values stored in CrawlDB. The idea is simple: I have an external program (or a script) which checks for changes in a directory (url injected in the crawldb). When there are new changes recorded, the program will update the status in the crawldb and generate a new fetch list for the fetcher to fetch. I do not want to make great changes to the nutch source code as I want the program to be compatible with future releases. Now, I know the crawldatum is saved in the crawldb with the url. I am not too sure but I think the url is the key to retrieve the crawldatum. For my program to work successfully, I need to know the following: * How to read data from the crawldb; what data structure does it use and how to referenced to it? Crawldb is essentially a list of url, CrawlDatum pairs and is stores as a MapFile. So you can read it with MapFile.Reader.get. * How to write back to the crawldb; updating information back to the crawldb or probably creating a new with changed and unchanged values. Current FS implementation is write-once, so you can't modify it. But you can read it one-by-one(possibly with MapFile.Reader.next) and then write a new one with MapFile.Writer. This is an extract from the crawldb: http://some-url.com/Version: 4 Status: 2 (DB_fetched) Fetch time: Thu Feb 22 12:44:05 GMT 2007 Modified time: Thu Jan 01 01:00:00 GMT 1970 Retries since fetch: 0 Retry interval: 30.0 days Score: 1.0323955 Signature: f4c14c46074b66aad8829b8aa84cd636 Metadata: null How can get this information with an external program and modify/ update it. Once I know how to implement that part, I can call nutch in the usual way of generate - fetch - updatedb - updatelinkdb -index -etc.. so generate will have the new value that I want re-indexed. This will stop the fetcher from fetching a long list of urls (changed or unchanged but need fetching because of their next_fetch_time is due). The program gets its update from the underlying OS to know notify about any changes to files and folders being monitored. Once the program is working with sufficient tests, I will be willing to share the source code; it's written in java and doesn't need any script to launch nutch. I will be looking forward to your kind support. Armel - Armel T. Nene iDNA Solutions Tel: +44 (207) 257 6124 Mobile: +44 (788) 695 0483 http://blog.idna-solutions.com/ http://blog.idna-solutions.com
is crawldb format in Nutch 0.8 compatible with Nutch0.7
Hi guys, I am running in some nightmares when trying to iterate over values in the Nutch 0.8.2 crawldb. I am getting some hadoop exception such as the following: 07/01/23 18:33:56 INFO conf.Configuration: parsing jar:file:/C:/nutch-0.8.2-dev/lib/hadoop-0.4.0-patched.jar!/hadoop-default.xm l 07/01/23 18:33:56 INFO conf.Configuration: parsing jar:file:/C:/nutch-0.8.2-dev/nutch-0.8.2-dev.jar!/nutch-default.xml 07/01/23 18:33:56 INFO conf.Configuration: parsing jar:file:/C:/nutch-0.8.2-dev/nutch-0.8.2-dev.jar!/nutch-site.xml Exception in thread main java.lang.ArithmeticException: / by zero at org.apache.hadoop.mapred.lib.HashPartitioner.getPartition(HashPartitioner.ja va:33) at org.apache.hadoop.mapred.MapFileOutputFormat.getEntry(MapFileOutputFormat.ja va:88) at org.apache.nutch.crawl.CrawlDbReader.get(CrawlDbReader.java:321) therefore, if I can iterate over the values contained in the crawldb using Nutch 0.7 API, I should think this will fix the issue. So the question is; is Nutch 0.8 backward compatible with Nutch 0.7.2 Thanks, Armel - Armel T. Nene iDNA Solutions Tel: +44 (207) 257 6124 Mobile: +44 (788) 695 0483 http://blog.idna-solutions.com/ http://blog.idna-solutions.com
java.lang.IllegalStateException
Hi guys, I am using Nutch 0.8.1, for the past 2 days I have been getting the following exception: Java.Lang.IllegalStateException. The exception started after I implementing the Nutch-61 patch; Adaptive Re-crawl Interval. In short, this happens: I am trying to crawl XML files (locally and remotely on a web server), once crawl, the fetcher sends the file to their processing parsers. This is where the exception is thrown as the parsers launches but do not perform any activity on the file. If anybody has dealt with this type error, please let me know how to get rid of it. Below is an extract from my log file. 2007-01-18 14:16:16,371 INFO parse.xml - XMLParser config path : .. 2007-01-18 14:16:16,371 INFO parse.xml - XMLParser config path : .. 2007-01-18 14:16:16,371 WARN fetcher.Fetcher - Error parsing: file:/C:/880254/8802_583254_20051006_12.xml: failed(2,200): java.lang.IllegalStateException: Root element not set 2007-01-18 14:16:16,371 WARN fetcher.Fetcher - Error parsing: file:/C:/880254/8802_583254_20051006_11.xml: failed(2,200): java.lang.IllegalStateException: Root element not set 2007-01-18 14:16:16,387 INFO parse.xml - XMLParser config path : .. 2007-01-18 14:16:16,403 WARN fetcher.Fetcher - Error parsing: file:/C:/880254/8802_583254_20051006_13.xml: failed(2,200): java.lang.IllegalStateException: Root element not set 2007-01-18 14:16:16,403 INFO parse.xml - XMLParser config path : .. 2007-01-18 14:16:16,403 WARN fetcher.Fetcher - Error parsing: file:/C:/880254/8802_583254_20051006_14.xml: failed(2,200): java.lang.IllegalStateException: Root element not set 2007-01-18 14:16:16,418 INFO parse.xml - XMLParser config path : .. 2007-01-18 14:16:16,418 WARN fetcher.Fetcher - Error parsing: file:/C:/880254/8802_583254_20051006_10.xml: failed(2,200): java.lang.IllegalStateException: Root element not set 2007-01-18 14:16:17,887 INFO fetcher.Fetcher - Fetcher: done If a root element is not set within an XML file, a nullpointer exception is thrown not an illegalstateexception. #can anyone put some lights on this error. Thanks. Armel - Armel T. Nene iDNA Solutions Tel: +44 (207) 257 6124 Mobile: +44 (788) 695 0483 http://blog.idna-solutions.com/ http://blog.idna-solutions.com
RE: Next Nutch release
Hi guys, I have been working on NUTCH-61 Adaptive re-fetch interval. Detecting unmodified content applying it to Nutch 0.8.1. Here are some points: 1.This feature is great for Nutch to have has it differentiate between modified and unmodified content, therefore not indexing twice even if the document fetch time has arrived. a.There are some performance issues here. Even with this patch, Nutch still fetches the content and then checks its status against the last modified time in the database. If it has to check for a 1000 files before indexing the following 10 files, this will cause a real problem for those that are after real time indexing. 2.Since, I applied this patch to Nutch 0.8.1, when I try to parse xml files with our modified version of the xmlparser /indexer plugin; the fetcher throws the following exception: WARN fetcher.Fetcher - Error parsing: file:/C:/880254/8802_583254_20051006_12.xml: failed(2,200): java.lang.IllegalStateException: Root element not set The system will not hang or crash but the xml file will be indexed without any generated fields. The plugins works fine without the patch. I have another parser that parses graphics and other formats that fails when used with the patch. So far this problem occurs when using the file protocol. 3.the patch works fine when indexing web site using the http protocol. I am willing to work with Andrzej to make it stable as I understand it's the architect of this patch. I have the possibility of testing it in a mix environment in our computer lab. This patch can be the stepping stone for other features such real time indexing and fetch queue for index updating as opposed to creating a new index each time. Best Regards, Armel - Armel T. Nene iDNA Solutions Tel: +44 (207) 257 6124 Mobile: +44 (788) 695 0483 http://blog.idna-solutions.com -Original Message- From: Enis Soztutar [mailto:[EMAIL PROTECTED] Sent: 17 January 2007 15:39 To: nutch-dev@lucene.apache.org Subject: Re: Next Nutch release Sami Siren wrote: 2007/1/17, Enis Soztutar [EMAIL PROTECTED]: Hi all, for NUTCH-251: I suppose that NUTCH-251 is relatively a significant issue by the votes. Stafan has written a good plugin for the admin gui and i have updated it to work with nutch-0.8, hadoop 0.4. Good to hear someone is working on that! Why not target it to trunk version of Nutch? It is targetted to the trunk already. The previous was targetted to nutch-0.8, hadoop 0.4, since back then that versions was the latest in the trunk - a web server to serve plugin jsp's Why not make it regular war? also please consider making a clean separation of view/logic when you implement the web ui. As Stafan's version used embedded Jetty server, I continued this way. But i will consider that possibility also. -- Sami Siren
protocol-smb: a new protocol plugin for Windows Shares
Hi guys, We've developed a plugin http://issues.apache.org/jira/browse/NUTCH-427 This plugin allows you to crawl MS Windows Share. It uses a property files to read user credentials. We'd appreciate community feedback to these issues, and possible inclusion in future versions. Best regards, Armel T. Nene
Nutch site crawling
Hi, Is it possible to let Nutch crawl a set of documents at a time? I have set-up Nutch with the following option: topN 20 depth 2 Therefore I wanted Nutch to crawl my directory and just as deep as 2 links from the root directory. Now the root directory itself contains more than 20 files but my understanding of the topN is to make the crawler fetch 20 documents and then index. At the next crawl, the it chooses another 20 files from the directory and fetches and indexex them. My problem is that when Nutch crawls, it keeps on fetching the same files over and over again. That is a severe issue in my case because I have to run Nutch on some directory with more than 100 GB of data. It is more efficient to crawl a small set of files at a time to index than try to fetch all the data before indexing. Can you let me a workaround this? Or just let me know what I am doing wrong. Thanks in advance. Regards, Armel
Nutch Re-crawl same file over and over again
Hi, I have setup Nutch to crawl my local filesystem. I set a topN 20 and Depth 2. But when Nutch re-crawls, it re-crawls the same files over and over again. The directory doesn't contain any other sub-directories, can someone let me what might be the cause. There are more than 20 files in the directory so why nutch only getting the same twenty files? Thanks, Armel -Original Message- From: Michael Stack [mailto:[EMAIL PROTECTED] Sent: 06 December 2006 16:04 To: Shay Lawless Cc: nutch-user@lucene.apache.org; nutch-dev@lucene.apache.org; [EMAIL PROTECTED] Subject: Re: [Archive-access-discuss] Full List of Metadata Fields Hey Shay. Some friendly advice. Cross-posting a question will make you unpopular fast. Its best to start on the most appropriate seeming list and only move on from there if you are getting no satisfaction. The below question looks best at home over on the archive-access list. Let me have a go at answering it there. Yours, St.Ack Shay Lawless wrote: Hi all, I'm using NutchWax (Version 0.7.0-200611082313) and Wera (Version 0.5.0-200611082313) to Index a collection of ARC files generated by a web crawl using the Heritrix web crawler (Version 1.4.0). When I check the metadata tag on the wera front-end the following list of tags are displayed ARC Identifier URL Time of Archival Last Modified Time Mime-Type File Status Content Checksum HTTP Header When I click on the explain link in the NutchWax front-end the following list of tags are displayed Segment Digest Date ARCDate Encoding Collection ARCName ARCOffset ContentLength PrimaryType subType URL Title Boost Is there a full list of the metadata fields that NutchWax/Nutch creates when indexing? I'm particularly interested in tags relating to the actual content on each page i.e. content type, description etc etc When searching does NutchWax/Nutch search across such tags or just across the parsed text of each page for occurances of keywords etc? Any help you can provide would be greatly appreciated! Shay - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV ___ Archive-access-discuss mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/archive-access-discuss
RE: Indexing and Re-crawling site
Lukas, I was wondering about running Nutch as Windows Services. I was able to implement it as follow: 1.Creating a java program that will act as a Nutch and Launcher and re-crawler. 2.Download JavaService from http://javaservice.objectweb.org/ 3.Follow the tutorial to turn your java program in a Window service I then was able to test it on Windows Server 2003 and XP. It works fine. If you want to me to post the code let me know, maybe others can use it too. Regards, Armel -Original Message- From: Lukas Vlcek [mailto:[EMAIL PROTECTED] Sent: 04 December 2006 22:12 To: nutch-dev@lucene.apache.org Subject: Re: Indexing and Re-crawling site Hi, I will try to use my out-dated knowledge to answer ( confuse you on) your items: 1. Why does nutch has to create a new index every time when indexing, while it can just merge it with the old existing index? I try to change the value in the IndexMerger class to 'false' while creating an index therefore Lucene doesn't recreate a new index each time it is indexing. The problem with this is, I keep on having some exception when it tries to merge the indexes. There is a lock time out exception that is thrown by the IndexMerger. And consequently the index that get created. Is it possible to let nutch index by merging it with an existing index? I have to crawl about 100Gb of data and if there are only a few documents that have been changed, I don't nutch to recreate a new index because of that but update the existing index by merging it with the new one. I need some light on this. This is more for Nutch experts but to me it seems that new index is reasonable. Besides others it means that original index is still searchable while the new index is being created (creating a new index can take long time based on your settings). Updating one document at a time in large index is not very optimal approach I think. 2. What is the best way to make nutch re-crawl? I have implemented a class that loops the crawl process; it has a crawl interval which is set in a property file and a running status. The running status is a Boolean variable which is set to true if the re-crawl process is ongoing or false if it should stop. But with this approach, it seems that the index is not being fully generated. The values in the index cannot be queried. The re-crawl is in java which calls an underlying ant script to run nutch. I know most re-crawl are written as batch script but can you tell me which one do you recommended? A batch script or a loop-based java program? I used to use batch and was happy with it. 3. What is the best way of implementing nutch has a window service or unix daemon? Sorry - what do you mean byt this? Regards, Lukas
Indexing and Re-crawling site
Hi guys, I have a few questions regarding the way nutch indexes and the best way a recrawl can be implemented. 1. Why does nutch has to create a new index every time when indexing, while it can just merge it with the old existing index? I try to change the value in the IndexMerger class to 'false' while creating an index therefore Lucene doesn't recreate a new index each time it is indexing. The problem with this is, I keep on having some exception when it tries to merge the indexes. There is a lock time out exception that is thrown by the IndexMerger. And consequently the index that get created. Is it possible to let nutch index by merging it with an existing index? I have to crawl about 100Gb of data and if there are only a few documents that have been changed, I don't nutch to recreate a new index because of that but update the existing index by merging it with the new one. I need some light on this. 2. What is the best way to make nutch re-crawl? I have implemented a class that loops the crawl process; it has a crawl interval which is set in a property file and a running status. The running status is a Boolean variable which is set to true if the re-crawl process is ongoing or false if it should stop. But with this approach, it seems that the index is not being fully generated. The values in the index cannot be queried. The re-crawl is in java which calls an underlying ant script to run nutch. I know most re-crawl are written as batch script but can you tell me which one do you recommended? A batch script or a loop-based java program? 3. What is the best way of implementing nutch has a window service or unix daemon? Thanks, Armel
RE: [jira] Created: (NUTCH-408) Plugin development documentation
I agree with you that documentation is vital not the just extending the current version but also for any plugins and patches created. I have been spending almost two weeks trying to adapt nutch to my project but I spend more time in reading code and trying to understand what they do before I can even start to fix problem. Come on guys, documentation is good coding practice, we can't read your mind to know exactly what you were trying to achieve by just looking at the implementation code. This is just a good constructive criticism. :) Armel -Original Message- From: nutch.newbie (JIRA) [mailto:[EMAIL PROTECTED] Sent: 25 November 2006 03:45 To: nutch-dev@lucene.apache.org Subject: [jira] Created: (NUTCH-408) Plugin development documentation Plugin development documentation Key: NUTCH-408 URL: http://issues.apache.org/jira/browse/NUTCH-408 Project: Nutch Issue Type: Improvement Affects Versions: 0.8.1 Environment: Linux Fedora Reporter: nutch.newbie Documentation is rare! But very vital for extending current (0.9) nutch. Current docs on the wiki for 0.7 plugin development was good but it doesn't apply to 0.9 and new developers who are joining directly 0.9 find the 0.7 documentation not enough. A more practical plugin writing documentation for 0.9 is desired also exposing the plugin principals in practical terms i.e. extension points and libs etc. furthermore it would be good to provide some best practice example i.e. look for the lib you are planning to use if its already in lib folder and maybe that version of the external lib is good for the plugin dev rather then using another version things like that.. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
RE: Nutch folder configuration
Also can Nutch be run as a Windows services. Let me know so that I don't waste my time trying to code something that won't work. -Original Message- From: Armel T. Nene [mailto:[EMAIL PROTECTED] Sent: 21 November 2006 21:56 To: nutch-dev@lucene.apache.org Subject: Nutch folder configuration Hi all, I want to configure Nutch so that I can have various folders such as: conf, crawldb and index stored on different drive. So far, it keeps on giving me the following error: ERROR mapred.JobClient: Input directory C:/omittted/omitted/testcrawl/urls in local is invalid. Is Nutch always looking for folders in its current directory? I am also writing a java client to be able to launch Nutch without the script so that it can be wrapped as Windows services. I am having problem with Nutch classpath, can you wise me up on that issue too. But first how can let Nutch know that the folders are stored in different location. The settings for the folders are loaded from a property file and the values are passed to Generator, Injector, Fetcher and Indexer but stills has problem with it. I am looking forward to good tip on this. Armel
RE: [jira] Commented: (NUTCH-185) XMLParser is configurable xml parser plugin.
Rida, There is something I would like to clarify, when using a namespace and xpath to store content in the index, can this be seen as multi-fields. For example if we are storing customer name and customer address which are been declared in a xml configuration file, is that multi-field. Please explain, sorry I am quite new to the Nutch architecture. Armel -Original Message- From: Rida Benjelloun (JIRA) [mailto:[EMAIL PROTECTED] Sent: 20 November 2006 22:16 To: nutch-dev@lucene.apache.org Subject: [jira] Commented: (NUTCH-185) XMLParser is configurable xml parser plugin. [ http://issues.apache.org/jira/browse/NUTCH-185?page=comments#action_12451452 ] Rida Benjelloun commented on NUTCH-185: --- Nutch doesn't support multifieds values, so I decided to merge the content in the same field. If you want to search the field you should index it as Text instead of keyword. XMLParser is configurable xml parser plugin. Key: NUTCH-185 URL: http://issues.apache.org/jira/browse/NUTCH-185 Project: Nutch Issue Type: New Feature Components: fetcher, indexer Affects Versions: 0.7.2, 0.8.1, 0.8 Environment: OS Independent Reporter: Rida Benjelloun Attachments: parse-xml.patch, parse-xml.zip, parse-xml.zip Xml parser is configurable plugin. It use XPath and namespaces to do the mapping between the XML elements and Lucene fields. Informations : 1- Copy xmlparser-conf.xml to the nutch/conf dir 2- To index your custom XML file, you have to modify the xmlparser-conf.xml. This parser uses namespaces and XPATH to parse XML content The config file do the mapping between the XML noeds (using XPATH) and lucene field. Example : field name=dctitle xpath=//dc:title type=Text boost=1.4 / 3- The xmlIndexerProperties encapsulate a set of fields associated to a namespace. If the namespace is found in the xml document, the fields represented by the namespace will be indexed. Example : xmlIndexerProperties type=filePerDocument namespace= http://purl.org/dc/elements/1.1/; field name=dctitle xpath=//dc:title type=Text boost= 1.4 / field name=dccreator xpath=//dc:creator type=keyword boost= 1.0 / /xmlIndexerProperties 4- It is possible to define a default namespace that will be applied when the parser didn't find any namespace in the document or when the namespace found in the xml document doesn't match with the namespace defined in the xmlIndexerProperties. Example : xmlIndexerProperties type=filePerDocument namespace=default field name=xmlcontent xpath=//* type=Unstored boost=1.0 / /xmlIndexerProperties -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
RE: What's the status of Nutch-GUI?
Chris, Rida, Here the changes that I have made to XMLParseConfig.java in the populateConfig(Document doc) method: if (elemNode.getAttribute(nodeXpath) != null) { String nodeXpath = elemNode.getAttributeValue(namespace); xip.setNodeXpath(nodeXpath); } List fieldList = XPath.selectNodes(elemNode, field); if(fieldList != null) // modified 20062011 by Armel { for (int j = 0; j fieldList.size(); j++) { Element elem = (Element) fieldList.get(j); XMLField xf = populateXMLField(elem); fieldsColl.add(xf); } } /* * modifiied by Armel * 20062011 * if fieldList is empty because it doesn't contain * an element field */ if(fieldList == null){ XMLField xf = populateXMLField(elemNode); fieldsColl.add(xf); } And the populateXMLField(Element el) method: if (elem.getAttribute(name) != null) xf.setFieldName(elem.getAttributeValue(name)); if(elem.getAttribute(name)== null)// modified by Armel { List att = elem.getAttributes(); if(att != null){ // modified by Armel - loop and create field accondingly for (int i = 0; i att.size(); i++){ Attribute at = (Attribute)att.get(i); xf.setFieldName(elem.getAttributeValue(at.getName())); } } if (elem.getAttribute(xpath) != null) xf.setFieldXPath(elem.getAttributeValue(xpath)); this is supposed to do the feature I want to implement, please advise. Armel -Original Message- From: Chris Mattmann [mailto:[EMAIL PROTECTED] Sent: 20 November 2006 23:30 To: nutch-dev@lucene.apache.org Subject: Re: What's the status of Nutch-GUI? Hi Armel, On 11/20/06 1:44 PM, Armel T. Nene [EMAIL PROTECTED] wrote: Hi Chris, I am trying to extend parse-xml to enable the creation of lucene fields straight from an xml file. For example, a database table that has been parse as an XML file should be stored in the index with the relevant fields, i.e. customer name, address and so on. This file will not have a namespace associated with it and should not be stored as xmlcontent in the database. Currently, parse-xml looks for known fields in the document and stores the associated values with the field name. I have added an extra conditions as if the known fields are not present in the current document, the element or node in the document should be the new field stored in the index with their value. I think that this is fine. Therefore, when parse-xml receives an xml document with no namespace available, it will parse the document and store it element name as new field in the index and the element associated value. Let me know if I am on the right track because I know I don't have to write a separate plugin for this feature but just extending ( or modifying) parse-xml. I think that parse-xml will support what you are talking about. In terms of the check that you are doing to see if a field exists or not before adding another value for it in the index, as I understood Lucene, I believe that you could just omit this check and add the field regardless. If you add multiple values for the same field in a Document, e.g: snip Document doc = new Document(); doc.add(new Field(fieldname, fieldvalue, ...)); doc.add(new Field(fieldname, fieldvalue2,...)); /snip Both the values fieldvalue and fieldvalue2 will both get stored in the index for the key fieldname. So, if I understand you correctly (which I may not ;) ), then I think you can omit the check that you are talking about above and just go with adding the same field name 2x. HTH, Chris Cheers, Armel -Original Message- From: Chris Mattmann [mailto:[EMAIL PROTECTED] Sent: 20 November 2006 18:40 To: nutch-dev@lucene.apache.org Subject: Re: What's the status of Nutch-GUI? Hi Sami and Scott, This is on my TO-DO list as one of the items that I will begin working on getting into the sources as a committer. Additionally, I plan on integrating and testing the parse-xml plugin into the source tree. As soon as I get my Apache account and SVN access, I
File Protocol
I want to implement Nutch crawl a filesystem and if the content of the filesystem has changed since last crawled then and the system should be fetched again. I studied the code for the Adaptive Re-Fetch cycle but the patch is out of date as Nutch has implemented other features. Also, I don't want to change anything to the core code so that I can easily migrate to newer version. I want to develop the feature as a plugin similar to the Protocol-File plugin. I have been digging in the source code for the Protocol-File plugin and therefore have a few questions: My Nutch Revision is: 475201 from the subversion server. In the class File.java (Protocol-File plugin) , the getProtolOuput method has a condition as follow: Line 62 else if ((code = 300 code 400) code != 304) { // handle redirect if (redirects == MAX_REDIRECTS) throw new FileException(Too many redirects: + url); u = new URL(response.getHeader(Location)); redirects++; if (LOG.isTraceEnabled()) { LOG.trace(redirect to + u); } In my case, if the file has not been modified, the code will be 304 (NOT MODIFIED). I want to know the effect of this line on the CrawlDB. The file should not be removed or marked as GONE but the CrawlDatum.STATUS_FETCH_GONE. If that's the case already, this that mean I don't have to write a plugin to handle the checking of unmodified content. If not, tell me how the Protocol-File plugins check for unmodified content as it says it mimic an http response. Armel
RE: [jira] Commented: (NUTCH-61) Adaptive re-fetch interval. Detecting umodified content
Andrzej, the feature that I am after can be implemented by this patch if I just adapt it right. I am not sure of this but the patch seems a little bit old to be implemented in the latest release of Nutch 0.8.1. I want to implement a feature where the fetcher will fetch files but only add them if there have been modified after the latest fetch time. Now, I want to implement that on a filesystem first and then update later for network fetching. I would like to have a look at your full source code for your patch in a zip file if possible. Once the feature implemented, I will post it back here. I'd like to start working from your code first. You can either make the source code available here or mail them to me at armel dot nene @ idna-solutions dot com. -Original Message- From: Andrzej Bialecki (JIRA) [mailto:[EMAIL PROTECTED] Sent: 12 November 2006 19:39 To: nutch-dev@lucene.apache.org Subject: [jira] Commented: (NUTCH-61) Adaptive re-fetch interval. Detecting umodified content [ http://issues.apache.org/jira/browse/NUTCH-61?page=comments#action_12449170 ] Andrzej Bialecki commented on NUTCH-61: Unfortunately, this patch hasn't been applied yet, due to its complexity and lack of testing. But it will be, sooner or later, because this functionality is required for any serious use. I'm planning to bring this patch to the latest trunk, and then apply it piece-wise over the next couple of weeks. Adaptive re-fetch interval. Detecting umodified content --- Key: NUTCH-61 URL: http://issues.apache.org/jira/browse/NUTCH-61 Project: Nutch Issue Type: New Feature Components: fetcher Reporter: Andrzej Bialecki Assigned To: Andrzej Bialecki Attachments: 20050606.diff, 20051230.txt, 20060227.txt, nutch-61-417287.patch Currently Nutch doesn't adjust automatically its re-fetch period, no matter if individual pages change seldom or frequently. The goal of these changes is to extend the current codebase to support various possible adjustments to re-fetch times and intervals, and specifically a re-fetch schedule which tries to adapt the period between consecutive fetches to the period of content changes. Also, these patches implement checking if the content has changed since last fetching; protocol plugins are also changed to make use of this information, so that if content is unmodified it doesn't have to be fetched and processed. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira