parse-rss test problem
I can't test my parse-rss pluging in the nutch-0.8.1 I just can't test the default rsstest.rss file. 2007-01-25 17:04:34,703 INFO conf.Configuration (Configuration.java:getConfResourceAsInputStream(340)) - found resource parse-plugins.xml at file:/E:/work/digibot_news/build_tt/parse-plugins.xml 2007-01-25 17:04:35,328 WARN parse.rss (?:invoke0(?)) - org.apache.commons.feedparser.FeedParserException: java.lang.NoClassDefFoundError: org/jdom/Parent 2007-01-25 17:04:35,328 WARN parse.rss (?:invoke0(?)) - at org.apache.commons.feedparser.FeedParserImpl.parse(FeedParserImpl.java:191) 2007-01-25 17:04:35,343 WARN parse.rss (?:invoke0(?)) - at org.apache.commons.feedparser.FeedParserImpl.parse(FeedParserImpl.java:75) 2007-01-25 17:04:35,343 WARN parse.rss (?:invoke0(?)) - at org.apache.nutch.parse.rss.RSSParser.getParse(RSSParser.java:92) 2007-01-25 17:04:35,343 WARN parse.rss (?:invoke0(?)) - at org.apache.nutch.parse.ParseUtil.parseByExtensionId(ParseUtil.java:132) 2007-01-25 17:04:35,343 WARN parse.rss (?:invoke0(?)) - at org.apache.nutch.parse.rss.TestRSSParser.testIt(TestRSSParser.java:91) 2007-01-25 17:04:35,343 WARN parse.rss (?:invoke0(?)) - at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 2007-01-25 17:04:35,343 WARN parse.rss (?:invoke0(?)) - at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) 2007-01-25 17:04:35,359 WARN parse.rss (?:invoke0(?)) - at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) 2007-01-25 17:04:35,359 WARN parse.rss (?:invoke0(?)) - at java.lang.reflect.Method.invoke(Unknown Source) 2007-01-25 17:04:35,359 WARN parse.rss (?:invoke0(?)) - at junit.framework.TestCase.runTest(TestCase.java:154) 2007-01-25 17:04:35,359 WARN parse.rss (?:invoke0(?)) - at junit.framework.TestCase.runBare(TestCase.java:127) 2007-01-25 17:04:35,359 WARN parse.rss (?:invoke0(?)) - at junit.framework.TestResult$1.protect(TestResult.java:106) 2007-01-25 17:04:35,359 WARN parse.rss (?:invoke0(?)) - at junit.framework.TestResult.runProtected(TestResult.java:124) 2007-01-25 17:04:35,375 WARN parse.rss (?:invoke0(?)) - at junit.framework.TestResult.run(TestResult.java:109) 2007-01-25 17:04:35,375 WARN parse.rss (?:invoke0(?)) - at junit.framework.TestCase.run(TestCase.java:118) 2007-01-25 17:04:35,375 WARN parse.rss (?:invoke(?)) - at junit.framework.TestSuite.runTest(TestSuite.java:208) 2007-01-25 17:04:35,375 WARN parse.rss (?:invoke(?)) - at junit.framework.TestSuite.run(TestSuite.java:203) 2007-01-25 17:04:35,375 WARN parse.rss (?:invoke(?)) - at org.eclipse.jdt.internal.junit.runner.junit3.JUnit3TestReference.run(JUnit3T estReference.java:128) 2007-01-25 17:04:35,375 WARN parse.rss (?:invoke(?)) - at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:3 8) 2007-01-25 17:04:35,375 WARN parse.rss (?:invoke(?)) - at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRu nner.java:460) 2007-01-25 17:04:35,406 WARN parse.rss (?:invoke(?)) - at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRu nner.java:673) 2007-01-25 17:04:35,421 WARN parse.rss (?:invoke(?)) - at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner. java:386) 2007-01-25 17:04:35,421 WARN parse.rss (?:invoke(?)) - at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner .java:196) 2007-01-25 17:04:35,421 WARN parse.rss (?:invoke(?)) - Caused by: java.lang.NoClassDefFoundError: org/jdom/Parent 2007-01-25 17:04:35,421 WARN parse.rss (?:invoke(?)) - at org.jaxen.jdom.JDOMXPath.init(JDOMXPath.java:100) 2007-01-25 17:04:35,421 WARN parse.rss (?:invoke(?)) - at org.apache.commons.feedparser.RSSFeedParser.parse(RSSFeedParser.java:65) 2007-01-25 17:04:35,421 WARN parse.rss (?:invoke(?)) - at org.apache.commons.feedparser.FeedParserImpl.parse(FeedParserImpl.java:185) 2007-01-25 17:04:35,421 WARN parse.rss (?:invoke(?)) - ... 22 more 2007-01-25 17:04:35,421 WARN parse.rss (RSSParser.java:getParse(100)) - nutch:parse-rss:RSSParser Exception: java.lang.NoClassDefFoundError: org/jdom/Parent 2007-01-25 17:04:35,437 WARN parse.ParseUtil (ParseUtil.java:parseByExtensionId(138)) - Unable to successfully parse content file:/E:/work/digibot_news/rsstest.rss of type
Re: Fetcher2
please give us the url,thx On 1/25/07, chee wu [EMAIL PROTECTED] wrote: Just appended the portion for .81 to NUTCH-339 - Original Message - From: Armel T. Nene [EMAIL PROTECTED] To: nutch-dev@lucene.apache.org Sent: Thursday, January 25, 2007 8:06 AM Subject: RE: Fetcher2 Chee, Can you make the code available through Jira. Thanks, Armel - Armel T. Nene iDNA Solutions Tel: +44 (207) 257 6124 Mobile: +44 (788) 695 0483 http://blog.idna-solutions.com -Original Message- From: chee wu [mailto:[EMAIL PROTECTED] Sent: 24 January 2007 03:59 To: nutch-dev@lucene.apache.org Subject: Re: Fetcher2 Thanks! I successfully port Fetcher2 to Nutch.81, it's prettyly easy... I can share the code,if any one want to use .. - Original Message - From: Andrzej Bialecki [EMAIL PROTECTED] To: nutch-dev@lucene.apache.org Sent: Tuesday, January 23, 2007 12:09 AM Subject: Re: Fetcher2 chee wu wrote: Fetcher2 should be a great help for me,but seems can't integrate with Nutch81. Any advice on how to use it based on .81? You would have to port it to Nutch 0.8.1 - e.g. change all Text occurences to UTF8, and most likely make other changes too ... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com -- www.babatu.com
RE: Fetcher2
Kauu, The url for fetcher too is: https://issues.apache.org/jira/browse/NUTCH-339 Armel - Armel T. Nene iDNA Solutions Tel: +44 (207) 257 6124 Mobile: +44 (788) 695 0483 http://blog.idna-solutions.com -Original Message- From: kauu [mailto:[EMAIL PROTECTED] Sent: 25 January 2007 09:31 To: nutch-dev@lucene.apache.org Subject: Re: Fetcher2 please give us the url,thx On 1/25/07, chee wu [EMAIL PROTECTED] wrote: Just appended the portion for .81 to NUTCH-339 - Original Message - From: Armel T. Nene [EMAIL PROTECTED] To: nutch-dev@lucene.apache.org Sent: Thursday, January 25, 2007 8:06 AM Subject: RE: Fetcher2 Chee, Can you make the code available through Jira. Thanks, Armel - Armel T. Nene iDNA Solutions Tel: +44 (207) 257 6124 Mobile: +44 (788) 695 0483 http://blog.idna-solutions.com -Original Message- From: chee wu [mailto:[EMAIL PROTECTED] Sent: 24 January 2007 03:59 To: nutch-dev@lucene.apache.org Subject: Re: Fetcher2 Thanks! I successfully port Fetcher2 to Nutch.81, it's prettyly easy... I can share the code,if any one want to use .. - Original Message - From: Andrzej Bialecki [EMAIL PROTECTED] To: nutch-dev@lucene.apache.org Sent: Tuesday, January 23, 2007 12:09 AM Subject: Re: Fetcher2 chee wu wrote: Fetcher2 should be a great help for me,but seems can't integrate with Nutch81. Any advice on how to use it based on .81? You would have to port it to Nutch 0.8.1 - e.g. change all Text occurences to UTF8, and most likely make other changes too ... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com -- www.babatu.com
Modified date in crawldb
Hi guys, I am using Nutch 0.8.2-dev. I have notice that the crawldb does not actually save the last modified date of files. I have run a crawl on my local file system and the web. When I dumped the content of crawldb for both crawl, the modified date of the files were set to 01-Jan-1970 01:00:00. I don't if it's intended to be as is or if it's a bug. Therefore my question is: * How does the generator knows which file to crawl again? oIs it looking at the fetch time? oThe modified date as this can be misleading? There is a modified date returned in most http headers and files on file system all have modified date which is the last modified date. How come it's not stored in the crawldb? Here is an extract from my 2 crawls: http://dmoz.org/Arts/ Version: 4 Status: 2 (DB_fetched) Fetch time: Thu Feb 22 12:45:43 GMT 2007 Modified time: Thu Jan 01 01:00:00 GMT 1970 Retries since fetch: 0 Retry interval: 30.0 days Score: 0.013471641 Signature: fe52a0bcb1071070689d0f661c168648 Metadata: null file:/C:/TeamBinder/AddressBook/GLOBAL/GLOBAL_fAdrBook_0121.xml Version: 4 Status: 2 (DB_fetched) Fetch time: Sat Feb 24 10:31:44 GMT 2007 Modified time: Thu Jan 01 01:00:00 GMT 1970 Retries since fetch: 0 Retry interval: 30.0 days Score: 1.1035091E-4 Signature: 57254d9ca2988ce1bf7f92b6239d6ebc Metadata: null Looking forward to your reply. Regards, Armel - Armel T. Nene iDNA Solutions Tel: +44 (207) 257 6124 Mobile: +44 (788) 695 0483 http://blog.idna-solutions.com/ http://blog.idna-solutions.com
RE: Modified date in crawldb
Chee, Have you successfully applied Nutch-61 to Nutch 0.8.1. I worked on the version, was able to apply fully but not entirely successful in running with the XML parser plugin. If you have applied successfully let me know. Regards, Armel - Armel T. Nene iDNA Solutions Tel: +44 (207) 257 6124 Mobile: +44 (788) 695 0483 http://blog.idna-solutions.com -Original Message- From: chee wu [mailto:[EMAIL PROTECTED] Sent: 25 January 2007 13:44 To: nutch-dev@lucene.apache.org Subject: Re: Modified date in crawldb I also had this question a few days ago,and I am using Nutch0.8.1.It seems the Modified data will be used by Nutch-61, you can find detail at the link below: http://issues.apache.org/jira/browse/NUTCH-61 I haven't studied this JIRA, and just wrote a simple function to fulfill this. 1.Retrieve all the Date information contained in the page content, Regular Expression is used to identify the date information. 2.Chose the newest date got as the page modified date. 3.Call the method of setModifiedTime( ) of the crawlDataum object in FetcherThread.Output( ). Maybe you can use a parse filter to separate this function from the core code. I am also new to Nutch, if anything wrong ,please feel free point out. - Original Message - From: Armel T. Nene [EMAIL PROTECTED] To: nutch-dev@lucene.apache.org Sent: Thursday, January 25, 2007 7:52 PM Subject: Modified date in crawldb Hi guys, I am using Nutch 0.8.2-dev. I have notice that the crawldb does not actually save the last modified date of files. I have run a crawl on my local file system and the web. When I dumped the content of crawldb for both crawl, the modified date of the files were set to 01-Jan-1970 01:00:00. I don't if it's intended to be as is or if it's a bug. Therefore my question is: * How does the generator knows which file to crawl again? oIs it looking at the fetch time? oThe modified date as this can be misleading? There is a modified date returned in most http headers and files on file system all have modified date which is the last modified date. How come it's not stored in the crawldb? Here is an extract from my 2 crawls: http://dmoz.org/Arts/ Version: 4 Status: 2 (DB_fetched) Fetch time: Thu Feb 22 12:45:43 GMT 2007 Modified time: Thu Jan 01 01:00:00 GMT 1970 Retries since fetch: 0 Retry interval: 30.0 days Score: 0.013471641 Signature: fe52a0bcb1071070689d0f661c168648 Metadata: null file:/C:/TeamBinder/AddressBook/GLOBAL/GLOBAL_fAdrBook_0121.xml Version: 4 Status: 2 (DB_fetched) Fetch time: Sat Feb 24 10:31:44 GMT 2007 Modified time: Thu Jan 01 01:00:00 GMT 1970 Retries since fetch: 0 Retry interval: 30.0 days Score: 1.1035091E-4 Signature: 57254d9ca2988ce1bf7f92b6239d6ebc Metadata: null Looking forward to your reply. Regards, Armel - Armel T. Nene iDNA Solutions Tel: +44 (207) 257 6124 Mobile: +44 (788) 695 0483 http://blog.idna-solutions.com/ http://blog.idna-solutions.com
threads-safe methods in Nutch
Hi guys, I know it's me again. I have been testing Nutch robustly lately and here some threads issues that I found. I am running version 0.8.2-dev. When Nutch is initially run (either from script or ANT), it has a default of 10 threads for the fetcher. This is actually good for performance reason as large number of urls can be indexed fast enough. The problem is some plugins are not thread safe (or is it the fetcher that's not thread-safe). I am running the parse-xml plugin (Nutch-185) and some issues: When running multiple threads such as the default 10 threads, I have some inconsistency with the stored fields and values. I found out the first 6 documents will be indexed without problem and then 4 with errors, 4 correct and x numbers with errors and so forth. At first I couldn't see where the problem was, and after several debugging activities, I realize that it could be a threading issue. I run Nutch with the minimum threading of 1 and the fields were stored without any issues. I don't know how to conclude this but I think that the methods that Nutch uses for threading are not thread safe. I could be wrong therefore I am awaiting any reply. Regards, Armel - Armel T. Nene iDNA Solutions Tel: +44 (207) 257 6124 Mobile: +44 (788) 695 0483 http://blog.idna-solutions.com/ http://blog.idna-solutions.com
[jira] Commented: (NUTCH-433) java.io.EOFException in newer nightlies in mergesegs or indexing from hadoop.io.DataOutputBuffer
[ https://issues.apache.org/jira/browse/NUTCH-433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12467471 ] Brian Whitman commented on NUTCH-433: - This is still not fixed in the latest nightly -- http://people.apache.org/builds/lucene/nutch/nightly/nutch-2007-01-25.tar.gz -- same error. Also tried the svn trunk, no change. I imagine it's because it's a hadoop issue and not a nutch one, but the nutch nightly package should include the latest hadoop as well. java.io.EOFException in newer nightlies in mergesegs or indexing from hadoop.io.DataOutputBuffer Key: NUTCH-433 URL: https://issues.apache.org/jira/browse/NUTCH-433 Project: Nutch Issue Type: Bug Components: generator, indexer Affects Versions: 0.9.0 Environment: Both Linux/i686 and Mac OS X PPC/Intel, but platform independent Reporter: Brian Whitman Assigned To: Sami Siren Priority: Critical Fix For: 0.9.0 The nightly builds have not been working at all for the past couple of weeks. Sami Siren has narrowed it down to HADOOP-331. To replicate: download the nightly, then: bin/nutch inject crawl/crawldb urls/ # a single URL is in urls/urls -- http://apache.org bin/nutch generate crawl/crawldb crawl/segments bin/nutch fetch crawl/segments/2007... bin/nutch updatedb crawl/crawldb crawl/segments/2007... # generate a new segment with 5 URIs bin/nutch generate crawl/crawldb crawl/segments -topN 5 bin/nutch fetch crawl/segments/2007... # new segment bin/nutch updatedb crawl/crawldb crawl/segments/2007... # new segment # merge the segments and index bin/nutch mergesegs crawl/merged -dir crawl/segments .. We get a crash in the mergesegs. This crash, with the exact same script and start URI, configuration and plugins, does not happen on a nightly from early January. 2007-01-18 14:57:11,411 INFO segment.SegmentMerger - Merging 2 segments to crawl/merged_07_01_18_14_56_22/20070118145711 2007-01-18 14:57:11,482 INFO segment.SegmentMerger - SegmentMerger: adding crawl/segments/20070118145628 2007-01-18 14:57:11,489 INFO segment.SegmentMerger - SegmentMerger: adding crawl/segments/20070118145641 2007-01-18 14:57:11,495 INFO segment.SegmentMerger - SegmentMerger: using segment data from: content crawl_generate crawl_fetch crawl_parse parse_data parse_text 2007-01-18 14:57:11,594 INFO mapred.InputFormatBase - Total input paths to process : 12 2007-01-18 14:57:11,819 INFO mapred.JobClient - Running job: job_5ug2ip 2007-01-18 14:57:12,073 WARN mapred.LocalJobRunner - job_5ug2ip java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:178) at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:57) at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:91) at org.apache.hadoop.io.UTF8.readChars(UTF8.java:212) at org.apache.hadoop.io.UTF8.readString(UTF8.java:204) at org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:173) at org.apache.hadoop.io.ObjectWritable.readFields(ObjectWritable.java:61) at org.apache.nutch.metadata.MetaWrapper.readFields(MetaWrapper.java:100) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.spill(MapTask.java:427) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpillToDisk(MapTask.java:385) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$200(MapTask.java:239) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:188) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:109) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-433) java.io.EOFException in newer nightlies in mergesegs or indexing from hadoop.io.DataOutputBuffer
[ https://issues.apache.org/jira/browse/NUTCH-433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12467478 ] Andrzej Bialecki commented on NUTCH-433: - Nutch and Hadoop are separate projects, with the latter evolving at a neck-breaking speed. It would require significant effort to keep each Nutch nightly build synchronized with each nightly build of Hadoop. java.io.EOFException in newer nightlies in mergesegs or indexing from hadoop.io.DataOutputBuffer Key: NUTCH-433 URL: https://issues.apache.org/jira/browse/NUTCH-433 Project: Nutch Issue Type: Bug Components: generator, indexer Affects Versions: 0.9.0 Environment: Both Linux/i686 and Mac OS X PPC/Intel, but platform independent Reporter: Brian Whitman Assigned To: Sami Siren Priority: Critical Fix For: 0.9.0 The nightly builds have not been working at all for the past couple of weeks. Sami Siren has narrowed it down to HADOOP-331. To replicate: download the nightly, then: bin/nutch inject crawl/crawldb urls/ # a single URL is in urls/urls -- http://apache.org bin/nutch generate crawl/crawldb crawl/segments bin/nutch fetch crawl/segments/2007... bin/nutch updatedb crawl/crawldb crawl/segments/2007... # generate a new segment with 5 URIs bin/nutch generate crawl/crawldb crawl/segments -topN 5 bin/nutch fetch crawl/segments/2007... # new segment bin/nutch updatedb crawl/crawldb crawl/segments/2007... # new segment # merge the segments and index bin/nutch mergesegs crawl/merged -dir crawl/segments .. We get a crash in the mergesegs. This crash, with the exact same script and start URI, configuration and plugins, does not happen on a nightly from early January. 2007-01-18 14:57:11,411 INFO segment.SegmentMerger - Merging 2 segments to crawl/merged_07_01_18_14_56_22/20070118145711 2007-01-18 14:57:11,482 INFO segment.SegmentMerger - SegmentMerger: adding crawl/segments/20070118145628 2007-01-18 14:57:11,489 INFO segment.SegmentMerger - SegmentMerger: adding crawl/segments/20070118145641 2007-01-18 14:57:11,495 INFO segment.SegmentMerger - SegmentMerger: using segment data from: content crawl_generate crawl_fetch crawl_parse parse_data parse_text 2007-01-18 14:57:11,594 INFO mapred.InputFormatBase - Total input paths to process : 12 2007-01-18 14:57:11,819 INFO mapred.JobClient - Running job: job_5ug2ip 2007-01-18 14:57:12,073 WARN mapred.LocalJobRunner - job_5ug2ip java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:178) at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:57) at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:91) at org.apache.hadoop.io.UTF8.readChars(UTF8.java:212) at org.apache.hadoop.io.UTF8.readString(UTF8.java:204) at org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:173) at org.apache.hadoop.io.ObjectWritable.readFields(ObjectWritable.java:61) at org.apache.nutch.metadata.MetaWrapper.readFields(MetaWrapper.java:100) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.spill(MapTask.java:427) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpillToDisk(MapTask.java:385) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$200(MapTask.java:239) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:188) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:109) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-433) java.io.EOFException in newer nightlies in mergesegs or indexing from hadoop.io.DataOutputBuffer
[ https://issues.apache.org/jira/browse/NUTCH-433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12467486 ] Brian Whitman commented on NUTCH-433: - OK, understand. But the nutch nightly should at least include a version of hadoop that works with the corresponding nutch code. Should I reopen this bug? The underlying problem may have been fixed but it still doesn't work in the automated builds or building from svn. Is there perhaps a way to have a test suite (run a sample short crawl maybe?) in the nightly build process? java.io.EOFException in newer nightlies in mergesegs or indexing from hadoop.io.DataOutputBuffer Key: NUTCH-433 URL: https://issues.apache.org/jira/browse/NUTCH-433 Project: Nutch Issue Type: Bug Components: generator, indexer Affects Versions: 0.9.0 Environment: Both Linux/i686 and Mac OS X PPC/Intel, but platform independent Reporter: Brian Whitman Assigned To: Sami Siren Priority: Critical Fix For: 0.9.0 The nightly builds have not been working at all for the past couple of weeks. Sami Siren has narrowed it down to HADOOP-331. To replicate: download the nightly, then: bin/nutch inject crawl/crawldb urls/ # a single URL is in urls/urls -- http://apache.org bin/nutch generate crawl/crawldb crawl/segments bin/nutch fetch crawl/segments/2007... bin/nutch updatedb crawl/crawldb crawl/segments/2007... # generate a new segment with 5 URIs bin/nutch generate crawl/crawldb crawl/segments -topN 5 bin/nutch fetch crawl/segments/2007... # new segment bin/nutch updatedb crawl/crawldb crawl/segments/2007... # new segment # merge the segments and index bin/nutch mergesegs crawl/merged -dir crawl/segments .. We get a crash in the mergesegs. This crash, with the exact same script and start URI, configuration and plugins, does not happen on a nightly from early January. 2007-01-18 14:57:11,411 INFO segment.SegmentMerger - Merging 2 segments to crawl/merged_07_01_18_14_56_22/20070118145711 2007-01-18 14:57:11,482 INFO segment.SegmentMerger - SegmentMerger: adding crawl/segments/20070118145628 2007-01-18 14:57:11,489 INFO segment.SegmentMerger - SegmentMerger: adding crawl/segments/20070118145641 2007-01-18 14:57:11,495 INFO segment.SegmentMerger - SegmentMerger: using segment data from: content crawl_generate crawl_fetch crawl_parse parse_data parse_text 2007-01-18 14:57:11,594 INFO mapred.InputFormatBase - Total input paths to process : 12 2007-01-18 14:57:11,819 INFO mapred.JobClient - Running job: job_5ug2ip 2007-01-18 14:57:12,073 WARN mapred.LocalJobRunner - job_5ug2ip java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:178) at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:57) at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:91) at org.apache.hadoop.io.UTF8.readChars(UTF8.java:212) at org.apache.hadoop.io.UTF8.readString(UTF8.java:204) at org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:173) at org.apache.hadoop.io.ObjectWritable.readFields(ObjectWritable.java:61) at org.apache.nutch.metadata.MetaWrapper.readFields(MetaWrapper.java:100) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.spill(MapTask.java:427) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpillToDisk(MapTask.java:385) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$200(MapTask.java:239) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:188) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:109) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: i18n in nutch home page is misnomor
Teruhiko Kurosaka wrote: I suggest i18n be renamed to l10n, short for localization. Can you please file an issue in Jira for this? Ideally you could even provide a patch. The source for the website is in subversion at: http://svn.apache.org/repos/asf/lucene/nutch/trunk/src/site Forrest is used to generate the site from this. http://forrest.apache.org/ Doug
Re: [jira] Commented: (NUTCH-258) Once Nutch logs a SEVERE log item, Nutch fails forevermore
Scott Ganyo (JIRA) wrote: ... since Hadoop hijacks and reassigns all log formatters (also a bad practice!) in the org.apache.hadoop.util.LogFormatter static constructor ... FYI, Hadoop no longer does this. Doug
[jira] Commented: (NUTCH-433) java.io.EOFException in newer nightlies in mergesegs or indexing from hadoop.io.DataOutputBuffer
[ https://issues.apache.org/jira/browse/NUTCH-433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12467491 ] Sami Siren commented on NUTCH-433: -- ok, now it is committed, sorry. java.io.EOFException in newer nightlies in mergesegs or indexing from hadoop.io.DataOutputBuffer Key: NUTCH-433 URL: https://issues.apache.org/jira/browse/NUTCH-433 Project: Nutch Issue Type: Bug Components: generator, indexer Affects Versions: 0.9.0 Environment: Both Linux/i686 and Mac OS X PPC/Intel, but platform independent Reporter: Brian Whitman Assigned To: Sami Siren Priority: Critical Fix For: 0.9.0 The nightly builds have not been working at all for the past couple of weeks. Sami Siren has narrowed it down to HADOOP-331. To replicate: download the nightly, then: bin/nutch inject crawl/crawldb urls/ # a single URL is in urls/urls -- http://apache.org bin/nutch generate crawl/crawldb crawl/segments bin/nutch fetch crawl/segments/2007... bin/nutch updatedb crawl/crawldb crawl/segments/2007... # generate a new segment with 5 URIs bin/nutch generate crawl/crawldb crawl/segments -topN 5 bin/nutch fetch crawl/segments/2007... # new segment bin/nutch updatedb crawl/crawldb crawl/segments/2007... # new segment # merge the segments and index bin/nutch mergesegs crawl/merged -dir crawl/segments .. We get a crash in the mergesegs. This crash, with the exact same script and start URI, configuration and plugins, does not happen on a nightly from early January. 2007-01-18 14:57:11,411 INFO segment.SegmentMerger - Merging 2 segments to crawl/merged_07_01_18_14_56_22/20070118145711 2007-01-18 14:57:11,482 INFO segment.SegmentMerger - SegmentMerger: adding crawl/segments/20070118145628 2007-01-18 14:57:11,489 INFO segment.SegmentMerger - SegmentMerger: adding crawl/segments/20070118145641 2007-01-18 14:57:11,495 INFO segment.SegmentMerger - SegmentMerger: using segment data from: content crawl_generate crawl_fetch crawl_parse parse_data parse_text 2007-01-18 14:57:11,594 INFO mapred.InputFormatBase - Total input paths to process : 12 2007-01-18 14:57:11,819 INFO mapred.JobClient - Running job: job_5ug2ip 2007-01-18 14:57:12,073 WARN mapred.LocalJobRunner - job_5ug2ip java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:178) at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:57) at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:91) at org.apache.hadoop.io.UTF8.readChars(UTF8.java:212) at org.apache.hadoop.io.UTF8.readString(UTF8.java:204) at org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:173) at org.apache.hadoop.io.ObjectWritable.readFields(ObjectWritable.java:61) at org.apache.nutch.metadata.MetaWrapper.readFields(MetaWrapper.java:100) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.spill(MapTask.java:427) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpillToDisk(MapTask.java:385) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$200(MapTask.java:239) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:188) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:109) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Modified date in crawldb
Armel T. Nene wrote: Hi guys, I am using Nutch 0.8.2-dev. I have notice that the crawldb does not actually save the last modified date of files. I have run a crawl on my local file system and the web. When I dumped the content of crawldb for both crawl, the modified date of the files were set to 01-Jan-1970 01:00:00. I don't if it's intended to be as is or if it's a bug. Therefore my question is: * How does the generator knows which file to crawl again? oIs it looking at the fetch time? oThe modified date as this can be misleading? There is a modified date returned in most http headers and files on file system all have modified date which is the last modified date. How come it's not stored in the crawldb? This is the issue described in NUTCH-61 - patches from that issue will be applied soon to trunk/ . -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Next Nutch release
Dennis Kubes wrote: Andrzej Bialecki wrote: I believe that at this point it's crucial to keep the project well-focused (at the moment I think the main focus is on larger installations, and not the small ones), and also to make Nutch attractive to developers as a reusable search engine component. I think there are two areas. One is to keep the focus as you stated above. The other is to provide a path to get more people involved. If no one objects I will continue working on such a path. Please let me know if I can help in this people area. I'm currently unable to assist with technical Nutch issues on a day-to-day basis, but I am still very interested in doing what I can to ensure Nutch's long-term vitality as a project. Cheers, Doug
Re: [jira] Commented: (NUTCH-258) Once Nutch logs a SEVERE log item, Nutch fails forevermore
Hi Doug, So, does this render the patch that I wrote obsolete? Cheers, Chris On 1/25/07 10:08 AM, Doug Cutting [EMAIL PROTECTED] wrote: Scott Ganyo (JIRA) wrote: ... since Hadoop hijacks and reassigns all log formatters (also a bad practice!) in the org.apache.hadoop.util.LogFormatter static constructor ... FYI, Hadoop no longer does this. Doug __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology.
Re: [jira] Commented: (NUTCH-258) Once Nutch logs a SEVERE log item, Nutch fails forevermore
Chris Mattmann wrote: So, does this render the patch that I wrote obsolete? It's at least out-of-date and perhaps obsolete. A quick read of Fetcher.java looks like there might be a case where a fatal error is logged but the fetcher doesn't exit, in FetcherThread#output(). Doug
Re: [jira] Commented: (NUTCH-258) Once Nutch logs a SEVERE log item, Nutch fails forevermore
It's at least out-of-date and perhaps obsolete. A quick read of Fetcher.java looks like there might be a case where a fatal error is logged but the fetcher doesn't exit, in FetcherThread#output(). So this raises an interesting question: People (such as Scott G.) out there -- are you folks still experiencing similar problems? Do the recent Hadoop changes alleviate the bad behavior you were experiencing? If so, then maybe this issue should be closed... Cheers, Chris __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology.
Re: Modified date in crawldb
Armel, Sorry,I haven't tried this patch yet.. - Original Message - From: Armel T. Nene [EMAIL PROTECTED] To: nutch-dev@lucene.apache.org Sent: Thursday, January 25, 2007 11:07 PM Subject: RE: Modified date in crawldb Chee, Have you successfully applied Nutch-61 to Nutch 0.8.1. I worked on the version, was able to apply fully but not entirely successful in running with the XML parser plugin. If you have applied successfully let me know. Regards, Armel - Armel T. Nene iDNA Solutions Tel: +44 (207) 257 6124 Mobile: +44 (788) 695 0483 http://blog.idna-solutions.com -Original Message- From: chee wu [mailto:[EMAIL PROTECTED] Sent: 25 January 2007 13:44 To: nutch-dev@lucene.apache.org Subject: Re: Modified date in crawldb I also had this question a few days ago,and I am using Nutch0.8.1.It seems the Modified data will be used by Nutch-61, you can find detail at the link below: http://issues.apache.org/jira/browse/NUTCH-61 I haven't studied this JIRA, and just wrote a simple function to fulfill this. 1.Retrieve all the Date information contained in the page content, Regular Expression is used to identify the date information. 2.Chose the newest date got as the page modified date. 3.Call the method of setModifiedTime( ) of the crawlDataum object in FetcherThread.Output( ). Maybe you can use a parse filter to separate this function from the core code. I am also new to Nutch, if anything wrong ,please feel free point out. - Original Message - From: Armel T. Nene [EMAIL PROTECTED] To: nutch-dev@lucene.apache.org Sent: Thursday, January 25, 2007 7:52 PM Subject: Modified date in crawldb Hi guys, I am using Nutch 0.8.2-dev. I have notice that the crawldb does not actually save the last modified date of files. I have run a crawl on my local file system and the web. When I dumped the content of crawldb for both crawl, the modified date of the files were set to 01-Jan-1970 01:00:00. I don't if it's intended to be as is or if it's a bug. Therefore my question is: * How does the generator knows which file to crawl again? oIs it looking at the fetch time? oThe modified date as this can be misleading? There is a modified date returned in most http headers and files on file system all have modified date which is the last modified date. How come it's not stored in the crawldb? Here is an extract from my 2 crawls: http://dmoz.org/Arts/ Version: 4 Status: 2 (DB_fetched) Fetch time: Thu Feb 22 12:45:43 GMT 2007 Modified time: Thu Jan 01 01:00:00 GMT 1970 Retries since fetch: 0 Retry interval: 30.0 days Score: 0.013471641 Signature: fe52a0bcb1071070689d0f661c168648 Metadata: null file:/C:/TeamBinder/AddressBook/GLOBAL/GLOBAL_fAdrBook_0121.xml Version: 4 Status: 2 (DB_fetched) Fetch time: Sat Feb 24 10:31:44 GMT 2007 Modified time: Thu Jan 01 01:00:00 GMT 1970 Retries since fetch: 0 Retry interval: 30.0 days Score: 1.1035091E-4 Signature: 57254d9ca2988ce1bf7f92b6239d6ebc Metadata: null Looking forward to your reply. Regards, Armel - Armel T. Nene iDNA Solutions Tel: +44 (207) 257 6124 Mobile: +44 (788) 695 0483 http://blog.idna-solutions.com/ http://blog.idna-solutions.com
parse-rss make them items as different pages
i want to crawl the rss feeds and parse them ,then index them and at last when search the content I just want that the hit just like an individual page. i don't know wether i tell u clearly. item title欧洲暴风雪后发制人 致航班延误交通混乱(组图)/title description暴风雪横扫欧洲,导致多次航班延误 1月24日,几架民航客机在德国斯图加特机场内等待去除机身上冰雪。1月24日,工作人员在德国南部的慕尼黑机场清扫飞机跑道上的积雪。 据报道,迟来的暴风雪连续两天横扫中... /description linkhttp://news.sohu.com/20070125/n247833568.shtml/link category搜狐焦点图新闻/category author[EMAIL PROTECTED]/author pubDateThu, 25 Jan 2007 11:29:11 +0800/pubDate commentshttp://comment.news.sohu.com/comment/topic.jsp?id=247833847/comments /item this one item in an rss file i want nutch deal with an item like an individual page. so i search something in this item,the nutch return it as a hit. so ... any one can tell me how to do about ? any reply will be appreciated -- www.babatu.com