parse-rss test problem

2007-01-25 Thread kauu
I can't test my parse-rss pluging in the nutch-0.8.1

 

I just can't test the default rsstest.rss file.

 

2007-01-25 17:04:34,703 INFO  conf.Configuration
(Configuration.java:getConfResourceAsInputStream(340)) - found resource
parse-plugins.xml at file:/E:/work/digibot_news/build_tt/parse-plugins.xml

2007-01-25 17:04:35,328 WARN  parse.rss (?:invoke0(?)) -
org.apache.commons.feedparser.FeedParserException:
java.lang.NoClassDefFoundError: org/jdom/Parent

2007-01-25 17:04:35,328 WARN  parse.rss (?:invoke0(?)) - at
org.apache.commons.feedparser.FeedParserImpl.parse(FeedParserImpl.java:191)

2007-01-25 17:04:35,343 WARN  parse.rss (?:invoke0(?)) - at
org.apache.commons.feedparser.FeedParserImpl.parse(FeedParserImpl.java:75)

2007-01-25 17:04:35,343 WARN  parse.rss (?:invoke0(?)) - at
org.apache.nutch.parse.rss.RSSParser.getParse(RSSParser.java:92)

2007-01-25 17:04:35,343 WARN  parse.rss (?:invoke0(?)) - at
org.apache.nutch.parse.ParseUtil.parseByExtensionId(ParseUtil.java:132)

2007-01-25 17:04:35,343 WARN  parse.rss (?:invoke0(?)) - at
org.apache.nutch.parse.rss.TestRSSParser.testIt(TestRSSParser.java:91)

2007-01-25 17:04:35,343 WARN  parse.rss (?:invoke0(?)) - at
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

2007-01-25 17:04:35,343 WARN  parse.rss (?:invoke0(?)) - at
sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)

2007-01-25 17:04:35,359 WARN  parse.rss (?:invoke0(?)) - at
sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)

2007-01-25 17:04:35,359 WARN  parse.rss (?:invoke0(?)) - at
java.lang.reflect.Method.invoke(Unknown Source)

2007-01-25 17:04:35,359 WARN  parse.rss (?:invoke0(?)) - at
junit.framework.TestCase.runTest(TestCase.java:154)

2007-01-25 17:04:35,359 WARN  parse.rss (?:invoke0(?)) - at
junit.framework.TestCase.runBare(TestCase.java:127)

2007-01-25 17:04:35,359 WARN  parse.rss (?:invoke0(?)) - at
junit.framework.TestResult$1.protect(TestResult.java:106)

2007-01-25 17:04:35,359 WARN  parse.rss (?:invoke0(?)) - at
junit.framework.TestResult.runProtected(TestResult.java:124)

2007-01-25 17:04:35,375 WARN  parse.rss (?:invoke0(?)) - at
junit.framework.TestResult.run(TestResult.java:109)

2007-01-25 17:04:35,375 WARN  parse.rss (?:invoke0(?)) - at
junit.framework.TestCase.run(TestCase.java:118)

2007-01-25 17:04:35,375 WARN  parse.rss (?:invoke(?)) - at
junit.framework.TestSuite.runTest(TestSuite.java:208)

2007-01-25 17:04:35,375 WARN  parse.rss (?:invoke(?)) - at
junit.framework.TestSuite.run(TestSuite.java:203)

2007-01-25 17:04:35,375 WARN  parse.rss (?:invoke(?)) - at
org.eclipse.jdt.internal.junit.runner.junit3.JUnit3TestReference.run(JUnit3T
estReference.java:128)

2007-01-25 17:04:35,375 WARN  parse.rss (?:invoke(?)) - at
org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:3
8)

2007-01-25 17:04:35,375 WARN  parse.rss (?:invoke(?)) - at
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRu
nner.java:460)

2007-01-25 17:04:35,406 WARN  parse.rss (?:invoke(?)) - at
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRu
nner.java:673)

2007-01-25 17:04:35,421 WARN  parse.rss (?:invoke(?)) - at
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.
java:386)

2007-01-25 17:04:35,421 WARN  parse.rss (?:invoke(?)) - at
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner
.java:196)

2007-01-25 17:04:35,421 WARN  parse.rss (?:invoke(?)) - Caused by:
java.lang.NoClassDefFoundError: org/jdom/Parent

2007-01-25 17:04:35,421 WARN  parse.rss (?:invoke(?)) - at
org.jaxen.jdom.JDOMXPath.init(JDOMXPath.java:100)

2007-01-25 17:04:35,421 WARN  parse.rss (?:invoke(?)) - at
org.apache.commons.feedparser.RSSFeedParser.parse(RSSFeedParser.java:65)

2007-01-25 17:04:35,421 WARN  parse.rss (?:invoke(?)) - at
org.apache.commons.feedparser.FeedParserImpl.parse(FeedParserImpl.java:185)

2007-01-25 17:04:35,421 WARN  parse.rss (?:invoke(?)) - ... 22 more

2007-01-25 17:04:35,421 WARN  parse.rss (RSSParser.java:getParse(100)) -
nutch:parse-rss:RSSParser Exception: java.lang.NoClassDefFoundError:
org/jdom/Parent

2007-01-25 17:04:35,437 WARN  parse.ParseUtil
(ParseUtil.java:parseByExtensionId(138)) - Unable to successfully parse
content file:/E:/work/digibot_news/rsstest.rss of type 

 



Re: Fetcher2

2007-01-25 Thread kauu

please give us the url,thx

On 1/25/07, chee wu [EMAIL PROTECTED] wrote:


Just appended the portion for .81  to NUTCH-339

- Original Message -
From: Armel T. Nene [EMAIL PROTECTED]
To: nutch-dev@lucene.apache.org
Sent: Thursday, January 25, 2007 8:06 AM
Subject: RE: Fetcher2


 Chee,

 Can you make the code available through Jira.

 Thanks,

 Armel

 -
 Armel T. Nene
 iDNA Solutions
 Tel: +44 (207) 257 6124
 Mobile: +44 (788) 695 0483
 http://blog.idna-solutions.com

 -Original Message-
 From: chee wu [mailto:[EMAIL PROTECTED]
 Sent: 24 January 2007 03:59
 To: nutch-dev@lucene.apache.org
 Subject: Re: Fetcher2

 Thanks! I successfully  port Fetcher2 to Nutch.81, it's prettyly easy...
I
 can share the code,if any one want to use ..
 - Original Message -
 From: Andrzej Bialecki [EMAIL PROTECTED]
 To: nutch-dev@lucene.apache.org
 Sent: Tuesday, January 23, 2007 12:09 AM
 Subject: Re: Fetcher2


 chee wu wrote:
 Fetcher2 should be a great help for me,but seems can't integrate with
 Nutch81.
 Any advice on how to use it based on .81?


 You would have to port it to Nutch 0.8.1 - e.g. change all Text
 occurences to UTF8, and most likely make other changes too ...

 --
 Best regards,
 Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com










--
www.babatu.com


RE: Fetcher2

2007-01-25 Thread Armel T. Nene
Kauu,

The url for fetcher too is: https://issues.apache.org/jira/browse/NUTCH-339

Armel

-
Armel T. Nene
iDNA Solutions
Tel: +44 (207) 257 6124
Mobile: +44 (788) 695 0483 
http://blog.idna-solutions.com
-Original Message-
From: kauu [mailto:[EMAIL PROTECTED] 
Sent: 25 January 2007 09:31
To: nutch-dev@lucene.apache.org
Subject: Re: Fetcher2

please give us the url,thx

On 1/25/07, chee wu [EMAIL PROTECTED] wrote:

 Just appended the portion for .81  to NUTCH-339

 - Original Message -
 From: Armel T. Nene [EMAIL PROTECTED]
 To: nutch-dev@lucene.apache.org
 Sent: Thursday, January 25, 2007 8:06 AM
 Subject: RE: Fetcher2


  Chee,
 
  Can you make the code available through Jira.
 
  Thanks,
 
  Armel
 
  -
  Armel T. Nene
  iDNA Solutions
  Tel: +44 (207) 257 6124
  Mobile: +44 (788) 695 0483
  http://blog.idna-solutions.com
 
  -Original Message-
  From: chee wu [mailto:[EMAIL PROTECTED]
  Sent: 24 January 2007 03:59
  To: nutch-dev@lucene.apache.org
  Subject: Re: Fetcher2
 
  Thanks! I successfully  port Fetcher2 to Nutch.81, it's prettyly easy...
 I
  can share the code,if any one want to use ..
  - Original Message -
  From: Andrzej Bialecki [EMAIL PROTECTED]
  To: nutch-dev@lucene.apache.org
  Sent: Tuesday, January 23, 2007 12:09 AM
  Subject: Re: Fetcher2
 
 
  chee wu wrote:
  Fetcher2 should be a great help for me,but seems can't integrate with
  Nutch81.
  Any advice on how to use it based on .81?
 
 
  You would have to port it to Nutch 0.8.1 - e.g. change all Text
  occurences to UTF8, and most likely make other changes too ...
 
  --
  Best regards,
  Andrzej Bialecki 
  ___. ___ ___ ___ _ _   __
  [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
  ___|||__||  \|  ||  |  Embedded Unix, System Integration
  http://www.sigram.com  Contact: info at sigram dot com
 
 
 
 
 




-- 
www.babatu.com



Modified date in crawldb

2007-01-25 Thread Armel T. Nene
Hi guys,

 

I am using Nutch 0.8.2-dev. I have notice that the crawldb does not actually
save the last modified date of files. I have run a crawl on my local file
system and the web. When I dumped the content of crawldb for both crawl, the
modified date of the files were set to 01-Jan-1970 01:00:00. I don't if it's
intended to be as is or if it's a bug. Therefore my question is:

 

* How does the generator knows which file to crawl again?

oIs it looking at the fetch time?

oThe modified date as this can be misleading?

 

There is a modified date returned in most http headers and files on file
system all have modified date which is the last modified date. How come it's
not stored in the crawldb?

 

Here is an extract from my 2 crawls:

 

http://dmoz.org/Arts/   Version: 4

Status: 2 (DB_fetched)

Fetch time: Thu Feb 22 12:45:43 GMT 2007

Modified time: Thu Jan 01 01:00:00 GMT 1970

Retries since fetch: 0

Retry interval: 30.0 days

Score: 0.013471641

Signature: fe52a0bcb1071070689d0f661c168648

Metadata: null

 

file:/C:/TeamBinder/AddressBook/GLOBAL/GLOBAL_fAdrBook_0121.xml
Version: 4

Status: 2 (DB_fetched)

Fetch time: Sat Feb 24 10:31:44 GMT 2007

Modified time: Thu Jan 01 01:00:00 GMT 1970

Retries since fetch: 0

Retry interval: 30.0 days

Score: 1.1035091E-4

Signature: 57254d9ca2988ce1bf7f92b6239d6ebc

Metadata: null

 

Looking forward to your reply.

 

Regards,

 

Armel

 

-

Armel T. Nene

iDNA Solutions

Tel: +44 (207) 257 6124

Mobile: +44 (788) 695 0483 

 http://blog.idna-solutions.com/ http://blog.idna-solutions.com

 



RE: Modified date in crawldb

2007-01-25 Thread Armel T. Nene
Chee,

Have you successfully applied Nutch-61 to Nutch 0.8.1. I worked on the
version, was able to apply fully but not entirely successful in running with
the XML parser plugin. If you have applied successfully let me know.

Regards,

Armel 
-
Armel T. Nene
iDNA Solutions
Tel: +44 (207) 257 6124
Mobile: +44 (788) 695 0483 
http://blog.idna-solutions.com

-Original Message-
From: chee wu [mailto:[EMAIL PROTECTED] 
Sent: 25 January 2007 13:44
To: nutch-dev@lucene.apache.org
Subject: Re: Modified date in crawldb

I also had this question a few days ago,and I am using Nutch0.8.1.It seems
the Modified data will be used by Nutch-61, you can find detail at the
link below: 
 http://issues.apache.org/jira/browse/NUTCH-61

I haven't studied this JIRA, and  just  wrote a simple function  to fulfill
this.
1.Retrieve all the Date information contained in the page content, Regular
Expression is used to identify the date information.
2.Chose the newest date got as the page modified date.
3.Call  the method of  setModifiedTime( )  of the crawlDataum object in
FetcherThread.Output( ).
Maybe you can use a parse filter to separate this function from the core
code.
I am also new to Nutch, if  anything  wrong ,please feel free point out.


- Original Message - 
From: Armel T. Nene [EMAIL PROTECTED]
To: nutch-dev@lucene.apache.org
Sent: Thursday, January 25, 2007 7:52 PM
Subject: Modified date in crawldb


 Hi guys,
 
 
 
 I am using Nutch 0.8.2-dev. I have notice that the crawldb does not
actually
 save the last modified date of files. I have run a crawl on my local file
 system and the web. When I dumped the content of crawldb for both crawl,
the
 modified date of the files were set to 01-Jan-1970 01:00:00. I don't if
it's
 intended to be as is or if it's a bug. Therefore my question is:
 
 
 
 * How does the generator knows which file to crawl again?
 
 oIs it looking at the fetch time?
 
 oThe modified date as this can be misleading?
 
 
 
 There is a modified date returned in most http headers and files on file
 system all have modified date which is the last modified date. How come
it's
 not stored in the crawldb?
 
 
 
 Here is an extract from my 2 crawls:
 
 
 
 http://dmoz.org/Arts/   Version: 4
 
 Status: 2 (DB_fetched)
 
 Fetch time: Thu Feb 22 12:45:43 GMT 2007
 
 Modified time: Thu Jan 01 01:00:00 GMT 1970
 
 Retries since fetch: 0
 
 Retry interval: 30.0 days
 
 Score: 0.013471641
 
 Signature: fe52a0bcb1071070689d0f661c168648
 
 Metadata: null
 
 
 
 file:/C:/TeamBinder/AddressBook/GLOBAL/GLOBAL_fAdrBook_0121.xml
 Version: 4
 
 Status: 2 (DB_fetched)
 
 Fetch time: Sat Feb 24 10:31:44 GMT 2007
 
 Modified time: Thu Jan 01 01:00:00 GMT 1970
 
 Retries since fetch: 0
 
 Retry interval: 30.0 days
 
 Score: 1.1035091E-4
 
 Signature: 57254d9ca2988ce1bf7f92b6239d6ebc
 
 Metadata: null
 
 
 
 Looking forward to your reply.
 
 
 
 Regards,
 
 
 
 Armel
 
 
 
 -
 
 Armel T. Nene
 
 iDNA Solutions
 
 Tel: +44 (207) 257 6124
 
 Mobile: +44 (788) 695 0483 
 
 http://blog.idna-solutions.com/ http://blog.idna-solutions.com
 
 
 




threads-safe methods in Nutch

2007-01-25 Thread Armel T. Nene
Hi guys,

 

I know it's me again. I have been testing Nutch robustly lately and here
some threads issues that I found.

I am running version 0.8.2-dev. When Nutch is initially run (either from
script or ANT), it has a default of 10 threads for the fetcher. This is
actually good for performance reason as large number of urls can be indexed
fast enough. The problem is some plugins are not thread safe (or is it the
fetcher that's not thread-safe).

 

I am running the parse-xml plugin (Nutch-185) and some issues:

 

When running multiple threads such as the default 10 threads, I have some
inconsistency with the stored fields and values. I found out the first 6
documents will be indexed without problem and then 4 with errors, 4 correct
and x numbers with errors and so forth. At first I couldn't see where the
problem was, and after several debugging activities, I realize that it could
be a threading issue. I run Nutch with the minimum threading of 1 and the
fields were stored without any issues.

 

I don't know how to conclude this but I think that the methods that Nutch
uses for threading are not thread safe. I could be wrong therefore I am
awaiting any reply.

 

Regards,

 

Armel

 

-

Armel T. Nene

iDNA Solutions

Tel: +44 (207) 257 6124

Mobile: +44 (788) 695 0483 

 http://blog.idna-solutions.com/ http://blog.idna-solutions.com

 



[jira] Commented: (NUTCH-433) java.io.EOFException in newer nightlies in mergesegs or indexing from hadoop.io.DataOutputBuffer

2007-01-25 Thread Brian Whitman (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12467471
 ] 

Brian Whitman commented on NUTCH-433:
-

This is still not fixed in the latest nightly  -- 
http://people.apache.org/builds/lucene/nutch/nightly/nutch-2007-01-25.tar.gz -- 
same error. Also tried the svn trunk, no change.

I imagine it's because it's a hadoop issue and not a nutch one, but the nutch 
nightly package should include the latest hadoop as well. 




 java.io.EOFException in newer nightlies in mergesegs or indexing from 
 hadoop.io.DataOutputBuffer
 

 Key: NUTCH-433
 URL: https://issues.apache.org/jira/browse/NUTCH-433
 Project: Nutch
  Issue Type: Bug
  Components: generator, indexer
Affects Versions: 0.9.0
 Environment: Both Linux/i686 and Mac OS X PPC/Intel, but platform 
 independent
Reporter: Brian Whitman
 Assigned To: Sami Siren
Priority: Critical
 Fix For: 0.9.0


 The nightly builds have not been working at all for the past couple of weeks. 
 Sami Siren has narrowed it down to HADOOP-331.
 To replicate: download the nightly, then:
 bin/nutch inject crawl/crawldb urls/  # a single URL is in urls/urls -- 
 http://apache.org
 bin/nutch generate crawl/crawldb crawl/segments
 bin/nutch fetch crawl/segments/2007...
 bin/nutch updatedb crawl/crawldb crawl/segments/2007...
 # generate a new segment with 5 URIs
 bin/nutch generate crawl/crawldb crawl/segments -topN 5
 bin/nutch fetch crawl/segments/2007... # new segment
 bin/nutch updatedb crawl/crawldb crawl/segments/2007... # new segment
 # merge the segments and index
 bin/nutch mergesegs crawl/merged -dir crawl/segments
 ..
 We get a crash in the mergesegs. This crash, with the exact same script and 
 start URI, configuration and plugins, does not happen on a nightly from early 
 January.
 2007-01-18 14:57:11,411 INFO  segment.SegmentMerger - Merging 2 segments to 
 crawl/merged_07_01_18_14_56_22/20070118145711
 2007-01-18 14:57:11,482 INFO  segment.SegmentMerger - SegmentMerger:   adding 
 crawl/segments/20070118145628
 2007-01-18 14:57:11,489 INFO  segment.SegmentMerger - SegmentMerger:   adding 
 crawl/segments/20070118145641
 2007-01-18 14:57:11,495 INFO  segment.SegmentMerger - SegmentMerger: using 
 segment data from: content crawl_generate crawl_fetch crawl_parse parse_data 
 parse_text
 2007-01-18 14:57:11,594 INFO  mapred.InputFormatBase - Total input paths to 
 process : 12
 2007-01-18 14:57:11,819 INFO  mapred.JobClient - Running job: job_5ug2ip
 2007-01-18 14:57:12,073 WARN  mapred.LocalJobRunner - job_5ug2ip
 java.io.EOFException
 at java.io.DataInputStream.readFully(DataInputStream.java:178)
 at 
 org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:57)
 at 
 org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:91)
 at org.apache.hadoop.io.UTF8.readChars(UTF8.java:212)
 at org.apache.hadoop.io.UTF8.readString(UTF8.java:204)
 at 
 org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:173)
 at 
 org.apache.hadoop.io.ObjectWritable.readFields(ObjectWritable.java:61)
 at 
 org.apache.nutch.metadata.MetaWrapper.readFields(MetaWrapper.java:100)
 at 
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.spill(MapTask.java:427)
 at 
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpillToDisk(MapTask.java:385)
 at 
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$200(MapTask.java:239)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:188)
 at 
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:109)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-433) java.io.EOFException in newer nightlies in mergesegs or indexing from hadoop.io.DataOutputBuffer

2007-01-25 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12467478
 ] 

Andrzej Bialecki  commented on NUTCH-433:
-

Nutch and Hadoop are separate projects, with the latter evolving at a 
neck-breaking speed. It would require significant effort to keep each Nutch 
nightly build synchronized with each nightly build of Hadoop.

 java.io.EOFException in newer nightlies in mergesegs or indexing from 
 hadoop.io.DataOutputBuffer
 

 Key: NUTCH-433
 URL: https://issues.apache.org/jira/browse/NUTCH-433
 Project: Nutch
  Issue Type: Bug
  Components: generator, indexer
Affects Versions: 0.9.0
 Environment: Both Linux/i686 and Mac OS X PPC/Intel, but platform 
 independent
Reporter: Brian Whitman
 Assigned To: Sami Siren
Priority: Critical
 Fix For: 0.9.0


 The nightly builds have not been working at all for the past couple of weeks. 
 Sami Siren has narrowed it down to HADOOP-331.
 To replicate: download the nightly, then:
 bin/nutch inject crawl/crawldb urls/  # a single URL is in urls/urls -- 
 http://apache.org
 bin/nutch generate crawl/crawldb crawl/segments
 bin/nutch fetch crawl/segments/2007...
 bin/nutch updatedb crawl/crawldb crawl/segments/2007...
 # generate a new segment with 5 URIs
 bin/nutch generate crawl/crawldb crawl/segments -topN 5
 bin/nutch fetch crawl/segments/2007... # new segment
 bin/nutch updatedb crawl/crawldb crawl/segments/2007... # new segment
 # merge the segments and index
 bin/nutch mergesegs crawl/merged -dir crawl/segments
 ..
 We get a crash in the mergesegs. This crash, with the exact same script and 
 start URI, configuration and plugins, does not happen on a nightly from early 
 January.
 2007-01-18 14:57:11,411 INFO  segment.SegmentMerger - Merging 2 segments to 
 crawl/merged_07_01_18_14_56_22/20070118145711
 2007-01-18 14:57:11,482 INFO  segment.SegmentMerger - SegmentMerger:   adding 
 crawl/segments/20070118145628
 2007-01-18 14:57:11,489 INFO  segment.SegmentMerger - SegmentMerger:   adding 
 crawl/segments/20070118145641
 2007-01-18 14:57:11,495 INFO  segment.SegmentMerger - SegmentMerger: using 
 segment data from: content crawl_generate crawl_fetch crawl_parse parse_data 
 parse_text
 2007-01-18 14:57:11,594 INFO  mapred.InputFormatBase - Total input paths to 
 process : 12
 2007-01-18 14:57:11,819 INFO  mapred.JobClient - Running job: job_5ug2ip
 2007-01-18 14:57:12,073 WARN  mapred.LocalJobRunner - job_5ug2ip
 java.io.EOFException
 at java.io.DataInputStream.readFully(DataInputStream.java:178)
 at 
 org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:57)
 at 
 org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:91)
 at org.apache.hadoop.io.UTF8.readChars(UTF8.java:212)
 at org.apache.hadoop.io.UTF8.readString(UTF8.java:204)
 at 
 org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:173)
 at 
 org.apache.hadoop.io.ObjectWritable.readFields(ObjectWritable.java:61)
 at 
 org.apache.nutch.metadata.MetaWrapper.readFields(MetaWrapper.java:100)
 at 
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.spill(MapTask.java:427)
 at 
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpillToDisk(MapTask.java:385)
 at 
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$200(MapTask.java:239)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:188)
 at 
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:109)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-433) java.io.EOFException in newer nightlies in mergesegs or indexing from hadoop.io.DataOutputBuffer

2007-01-25 Thread Brian Whitman (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12467486
 ] 

Brian Whitman commented on NUTCH-433:
-

OK, understand. But the nutch nightly should at least include a version of 
hadoop that works with the corresponding nutch code. 

Should I reopen this bug? The underlying problem may have been fixed but it 
still doesn't work in the automated builds or building from svn.

Is there perhaps a way to have a test suite (run a sample short crawl maybe?) 
in the nightly build process?




 java.io.EOFException in newer nightlies in mergesegs or indexing from 
 hadoop.io.DataOutputBuffer
 

 Key: NUTCH-433
 URL: https://issues.apache.org/jira/browse/NUTCH-433
 Project: Nutch
  Issue Type: Bug
  Components: generator, indexer
Affects Versions: 0.9.0
 Environment: Both Linux/i686 and Mac OS X PPC/Intel, but platform 
 independent
Reporter: Brian Whitman
 Assigned To: Sami Siren
Priority: Critical
 Fix For: 0.9.0


 The nightly builds have not been working at all for the past couple of weeks. 
 Sami Siren has narrowed it down to HADOOP-331.
 To replicate: download the nightly, then:
 bin/nutch inject crawl/crawldb urls/  # a single URL is in urls/urls -- 
 http://apache.org
 bin/nutch generate crawl/crawldb crawl/segments
 bin/nutch fetch crawl/segments/2007...
 bin/nutch updatedb crawl/crawldb crawl/segments/2007...
 # generate a new segment with 5 URIs
 bin/nutch generate crawl/crawldb crawl/segments -topN 5
 bin/nutch fetch crawl/segments/2007... # new segment
 bin/nutch updatedb crawl/crawldb crawl/segments/2007... # new segment
 # merge the segments and index
 bin/nutch mergesegs crawl/merged -dir crawl/segments
 ..
 We get a crash in the mergesegs. This crash, with the exact same script and 
 start URI, configuration and plugins, does not happen on a nightly from early 
 January.
 2007-01-18 14:57:11,411 INFO  segment.SegmentMerger - Merging 2 segments to 
 crawl/merged_07_01_18_14_56_22/20070118145711
 2007-01-18 14:57:11,482 INFO  segment.SegmentMerger - SegmentMerger:   adding 
 crawl/segments/20070118145628
 2007-01-18 14:57:11,489 INFO  segment.SegmentMerger - SegmentMerger:   adding 
 crawl/segments/20070118145641
 2007-01-18 14:57:11,495 INFO  segment.SegmentMerger - SegmentMerger: using 
 segment data from: content crawl_generate crawl_fetch crawl_parse parse_data 
 parse_text
 2007-01-18 14:57:11,594 INFO  mapred.InputFormatBase - Total input paths to 
 process : 12
 2007-01-18 14:57:11,819 INFO  mapred.JobClient - Running job: job_5ug2ip
 2007-01-18 14:57:12,073 WARN  mapred.LocalJobRunner - job_5ug2ip
 java.io.EOFException
 at java.io.DataInputStream.readFully(DataInputStream.java:178)
 at 
 org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:57)
 at 
 org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:91)
 at org.apache.hadoop.io.UTF8.readChars(UTF8.java:212)
 at org.apache.hadoop.io.UTF8.readString(UTF8.java:204)
 at 
 org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:173)
 at 
 org.apache.hadoop.io.ObjectWritable.readFields(ObjectWritable.java:61)
 at 
 org.apache.nutch.metadata.MetaWrapper.readFields(MetaWrapper.java:100)
 at 
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.spill(MapTask.java:427)
 at 
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpillToDisk(MapTask.java:385)
 at 
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$200(MapTask.java:239)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:188)
 at 
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:109)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: i18n in nutch home page is misnomor

2007-01-25 Thread Doug Cutting

Teruhiko Kurosaka wrote:

I suggest i18n be renamed to l10n, short for
localization.


Can you please file an issue in Jira for this?  Ideally you could even 
provide a patch.  The source for the website is in subversion at:


http://svn.apache.org/repos/asf/lucene/nutch/trunk/src/site

Forrest is used to generate the site from this.

http://forrest.apache.org/

Doug


Re: [jira] Commented: (NUTCH-258) Once Nutch logs a SEVERE log item, Nutch fails forevermore

2007-01-25 Thread Doug Cutting

Scott Ganyo (JIRA) wrote:

 ... since Hadoop hijacks and reassigns all log formatters (also a bad 
practice!) in the org.apache.hadoop.util.LogFormatter static constructor ...


FYI, Hadoop no longer does this.

Doug


[jira] Commented: (NUTCH-433) java.io.EOFException in newer nightlies in mergesegs or indexing from hadoop.io.DataOutputBuffer

2007-01-25 Thread Sami Siren (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12467491
 ] 

Sami Siren commented on NUTCH-433:
--

ok, now it is committed, sorry.

 java.io.EOFException in newer nightlies in mergesegs or indexing from 
 hadoop.io.DataOutputBuffer
 

 Key: NUTCH-433
 URL: https://issues.apache.org/jira/browse/NUTCH-433
 Project: Nutch
  Issue Type: Bug
  Components: generator, indexer
Affects Versions: 0.9.0
 Environment: Both Linux/i686 and Mac OS X PPC/Intel, but platform 
 independent
Reporter: Brian Whitman
 Assigned To: Sami Siren
Priority: Critical
 Fix For: 0.9.0


 The nightly builds have not been working at all for the past couple of weeks. 
 Sami Siren has narrowed it down to HADOOP-331.
 To replicate: download the nightly, then:
 bin/nutch inject crawl/crawldb urls/  # a single URL is in urls/urls -- 
 http://apache.org
 bin/nutch generate crawl/crawldb crawl/segments
 bin/nutch fetch crawl/segments/2007...
 bin/nutch updatedb crawl/crawldb crawl/segments/2007...
 # generate a new segment with 5 URIs
 bin/nutch generate crawl/crawldb crawl/segments -topN 5
 bin/nutch fetch crawl/segments/2007... # new segment
 bin/nutch updatedb crawl/crawldb crawl/segments/2007... # new segment
 # merge the segments and index
 bin/nutch mergesegs crawl/merged -dir crawl/segments
 ..
 We get a crash in the mergesegs. This crash, with the exact same script and 
 start URI, configuration and plugins, does not happen on a nightly from early 
 January.
 2007-01-18 14:57:11,411 INFO  segment.SegmentMerger - Merging 2 segments to 
 crawl/merged_07_01_18_14_56_22/20070118145711
 2007-01-18 14:57:11,482 INFO  segment.SegmentMerger - SegmentMerger:   adding 
 crawl/segments/20070118145628
 2007-01-18 14:57:11,489 INFO  segment.SegmentMerger - SegmentMerger:   adding 
 crawl/segments/20070118145641
 2007-01-18 14:57:11,495 INFO  segment.SegmentMerger - SegmentMerger: using 
 segment data from: content crawl_generate crawl_fetch crawl_parse parse_data 
 parse_text
 2007-01-18 14:57:11,594 INFO  mapred.InputFormatBase - Total input paths to 
 process : 12
 2007-01-18 14:57:11,819 INFO  mapred.JobClient - Running job: job_5ug2ip
 2007-01-18 14:57:12,073 WARN  mapred.LocalJobRunner - job_5ug2ip
 java.io.EOFException
 at java.io.DataInputStream.readFully(DataInputStream.java:178)
 at 
 org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:57)
 at 
 org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:91)
 at org.apache.hadoop.io.UTF8.readChars(UTF8.java:212)
 at org.apache.hadoop.io.UTF8.readString(UTF8.java:204)
 at 
 org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:173)
 at 
 org.apache.hadoop.io.ObjectWritable.readFields(ObjectWritable.java:61)
 at 
 org.apache.nutch.metadata.MetaWrapper.readFields(MetaWrapper.java:100)
 at 
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.spill(MapTask.java:427)
 at 
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpillToDisk(MapTask.java:385)
 at 
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$200(MapTask.java:239)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:188)
 at 
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:109)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Modified date in crawldb

2007-01-25 Thread Andrzej Bialecki

Armel T. Nene wrote:

Hi guys,

 


I am using Nutch 0.8.2-dev. I have notice that the crawldb does not actually
save the last modified date of files. I have run a crawl on my local file
system and the web. When I dumped the content of crawldb for both crawl, the
modified date of the files were set to 01-Jan-1970 01:00:00. I don't if it's
intended to be as is or if it's a bug. Therefore my question is:

 


* How does the generator knows which file to crawl again?

oIs it looking at the fetch time?

oThe modified date as this can be misleading?

 


There is a modified date returned in most http headers and files on file
system all have modified date which is the last modified date. How come it's
not stored in the crawldb?

  


This is the issue described in NUTCH-61 - patches from that issue will 
be applied soon to trunk/ .


--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




Re: Next Nutch release

2007-01-25 Thread Doug Cutting

Dennis Kubes wrote:

Andrzej Bialecki wrote:
I believe that at this point it's crucial to keep the project 
well-focused (at the moment I think the main focus is on larger 
installations, and not the small ones), and also to make Nutch 
attractive to developers as a reusable search engine component.


I think there are two areas.  One is to keep the focus as you stated 
above.  The other is to provide a path to get more people involved.  If 
no one objects I will continue working on such a path.


Please let me know if I can help in this people area.  I'm currently 
unable to assist with technical Nutch issues on a day-to-day basis, but 
I am still very interested in doing what I can to ensure Nutch's 
long-term vitality as a project.


Cheers,

Doug


Re: [jira] Commented: (NUTCH-258) Once Nutch logs a SEVERE log item, Nutch fails forevermore

2007-01-25 Thread Chris Mattmann
Hi Doug,

 So, does this render the patch that I wrote obsolete?

Cheers,
  Chris



On 1/25/07 10:08 AM, Doug Cutting [EMAIL PROTECTED] wrote:

 Scott Ganyo (JIRA) wrote:
  ... since Hadoop hijacks and reassigns all log formatters (also a bad
 practice!) in the org.apache.hadoop.util.LogFormatter static constructor ...
 
 FYI, Hadoop no longer does this.
 
 Doug

__
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.




Re: [jira] Commented: (NUTCH-258) Once Nutch logs a SEVERE log item, Nutch fails forevermore

2007-01-25 Thread Doug Cutting

Chris Mattmann wrote:

 So, does this render the patch that I wrote obsolete?


It's at least out-of-date and perhaps obsolete.  A quick read of 
Fetcher.java looks like there might be a case where a fatal error is 
logged but the fetcher doesn't exit, in FetcherThread#output().


Doug


Re: [jira] Commented: (NUTCH-258) Once Nutch logs a SEVERE log item, Nutch fails forevermore

2007-01-25 Thread Chris Mattmann
 It's at least out-of-date and perhaps obsolete.  A quick read of
 Fetcher.java looks like there might be a case where a fatal error is
 logged but the fetcher doesn't exit, in FetcherThread#output().
 

So this raises an interesting question:

People (such as Scott G.) out there -- are you folks still experiencing
similar problems? Do the recent Hadoop changes alleviate the bad behavior
you were experiencing? If so, then maybe this issue should be closed...

Cheers,
  Chris

__
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.




Re: Modified date in crawldb

2007-01-25 Thread chee wu
Armel,
   Sorry,I haven't tried this patch yet..

- Original Message - 
From: Armel T. Nene [EMAIL PROTECTED]
To: nutch-dev@lucene.apache.org
Sent: Thursday, January 25, 2007 11:07 PM
Subject: RE: Modified date in crawldb


 Chee,
 
 Have you successfully applied Nutch-61 to Nutch 0.8.1. I worked on the
 version, was able to apply fully but not entirely successful in running with
 the XML parser plugin. If you have applied successfully let me know.
 
 Regards,
 
 Armel 
 -
 Armel T. Nene
 iDNA Solutions
 Tel: +44 (207) 257 6124
 Mobile: +44 (788) 695 0483 
 http://blog.idna-solutions.com
 
 -Original Message-
 From: chee wu [mailto:[EMAIL PROTECTED] 
 Sent: 25 January 2007 13:44
 To: nutch-dev@lucene.apache.org
 Subject: Re: Modified date in crawldb
 
 I also had this question a few days ago,and I am using Nutch0.8.1.It seems
 the Modified data will be used by Nutch-61, you can find detail at the
 link below: 
 http://issues.apache.org/jira/browse/NUTCH-61
 
 I haven't studied this JIRA, and  just  wrote a simple function  to fulfill
 this.
 1.Retrieve all the Date information contained in the page content, Regular
 Expression is used to identify the date information.
 2.Chose the newest date got as the page modified date.
 3.Call  the method of  setModifiedTime( )  of the crawlDataum object in
 FetcherThread.Output( ).
 Maybe you can use a parse filter to separate this function from the core
 code.
 I am also new to Nutch, if  anything  wrong ,please feel free point out.
 
 
 - Original Message - 
 From: Armel T. Nene [EMAIL PROTECTED]
 To: nutch-dev@lucene.apache.org
 Sent: Thursday, January 25, 2007 7:52 PM
 Subject: Modified date in crawldb
 
 
 Hi guys,
 
 
 
 I am using Nutch 0.8.2-dev. I have notice that the crawldb does not
 actually
 save the last modified date of files. I have run a crawl on my local file
 system and the web. When I dumped the content of crawldb for both crawl,
 the
 modified date of the files were set to 01-Jan-1970 01:00:00. I don't if
 it's
 intended to be as is or if it's a bug. Therefore my question is:
 
 
 
 * How does the generator knows which file to crawl again?
 
 oIs it looking at the fetch time?
 
 oThe modified date as this can be misleading?
 
 
 
 There is a modified date returned in most http headers and files on file
 system all have modified date which is the last modified date. How come
 it's
 not stored in the crawldb?
 
 
 
 Here is an extract from my 2 crawls:
 
 
 
 http://dmoz.org/Arts/   Version: 4
 
 Status: 2 (DB_fetched)
 
 Fetch time: Thu Feb 22 12:45:43 GMT 2007
 
 Modified time: Thu Jan 01 01:00:00 GMT 1970
 
 Retries since fetch: 0
 
 Retry interval: 30.0 days
 
 Score: 0.013471641
 
 Signature: fe52a0bcb1071070689d0f661c168648
 
 Metadata: null
 
 
 
 file:/C:/TeamBinder/AddressBook/GLOBAL/GLOBAL_fAdrBook_0121.xml
 Version: 4
 
 Status: 2 (DB_fetched)
 
 Fetch time: Sat Feb 24 10:31:44 GMT 2007
 
 Modified time: Thu Jan 01 01:00:00 GMT 1970
 
 Retries since fetch: 0
 
 Retry interval: 30.0 days
 
 Score: 1.1035091E-4
 
 Signature: 57254d9ca2988ce1bf7f92b6239d6ebc
 
 Metadata: null
 
 
 
 Looking forward to your reply.
 
 
 
 Regards,
 
 
 
 Armel
 
 
 
 -
 
 Armel T. Nene
 
 iDNA Solutions
 
 Tel: +44 (207) 257 6124
 
 Mobile: +44 (788) 695 0483 
 
 http://blog.idna-solutions.com/ http://blog.idna-solutions.com
 
 
 

 


parse-rss make them items as different pages

2007-01-25 Thread kauu

i want to crawl the rss feeds and parse them ,then index them and at last
when search the content I just want that the hit just like an individual
page.


i don't know wether i tell u clearly.

item
   title欧洲暴风雪后发制人 致航班延误交通混乱(组图)/title
   description暴风雪横扫欧洲,导致多次航班延误
1月24日,几架民航客机在德国斯图加特机场内等待去除机身上冰雪。1月24日,工作人员在德国南部的慕尼黑机场清扫飞机跑道上的积雪。
据报道,迟来的暴风雪连续两天横扫中...
   /description
   linkhttp://news.sohu.com/20070125/n247833568.shtml/link
   category搜狐焦点图新闻/category
   author[EMAIL PROTECTED]/author
   pubDateThu, 25 Jan 2007 11:29:11 +0800/pubDate
   
commentshttp://comment.news.sohu.com/comment/topic.jsp?id=247833847/comments
/item

this one item in an rss file

i want nutch deal with an item like an individual page.

so i search something in this item,the nutch return it as a hit.

so ...
any one can tell me how to do about ?
any reply will be appreciated

--
www.babatu.com