Re: log4j problem

2007-01-31 Thread chee wu
set the two java arguments-Dhadoop.log.file and -Dhadoop.log.dir should fix 
your problem.
btw,not to put much chinese characters in your mail..
 

- Original Message - 
From: kauu [EMAIL PROTECTED]
To: nutch-dev@lucene.apache.org
Sent: Wednesday, January 31, 2007 1:45 PM
Subject: log4j problem


why when I changed the nutch/conf/log4j.properties

 

I just changed the first line 

  Log4j.rootLogger=info,drfa to log4j.rootLogger=debug,drfa

Like this:

***  **


# RootLogger - DailyRollingFileAppender

#log4j.rootLogger=INFO,DRFA

log4j.rootLogger=DEBUG,DRFA

 

# Logging Threshold

log4j.threshhold=ALL

 

#special logging requirements for some commandline tools

log4j.logger.org.apache.nutch.crawl.Crawl=INFO,cmdstdout

log4j.logger.org.apache.nutch.crawl.Injector=INFO,cmdstdout

log4j.logger.org.apache.nutch.crawl.Generator=INFO,cmdstdout

log4j.logger.org.apache.nutch.fetcher.Fetcher=INFO,cmdstdout

log4j.logger.org.apache.nutch.parse.ParseSegment=INFO,cmdstdout

log4j.logger.org.apache.nutch.crawl.CrawlDbReader=INFO,cmdstdout

log4j.logger.org.apache.nutch.crawl.CrawlDbMerger=INFO,cmdstdout

log4j.logger.org.apache.nutch.crawl.LinkDbReader=INFO,cmdstdout

log4j.logger.org.apache.nutch.segment.SegmentReader=INFO,cmdstdout

log4j.logger.org.apache.nutch.segment.SegmentMerger=INFO,cmdstdout

log4j.logger.org.apache.nutch.crawl.CrawlDb=INFO,cmdstdout

log4j.logger.org.apache.nutch.crawl.LinkDb=INFO,cmdstdout

log4j.logger.org.apache.nutch.crawl.LinkDbMerger=INFO,cmdstdout

log4j.logger.org.apache.nutch.indexer.Indexer=INFO,cmdstdout

log4j.logger.org.apache.nutch.indexer.DeleteDuplicates=INFO,cmdstdout

log4j.logger.org.apache.nutch.indexer.IndexMerger=INFO,cmdstdout 

??

*  **
*

*  In the console ,it show me the error like below

 

 

 

log4j:ERROR setFile(null,true) call failed.

java.io.FileNotFoundException: \ (???)

at java.io.FileOutputStream.openAppend(Native Method)

at java.io.FileOutputStream.init(Unknown Source)

at java.io.FileOutputStream.init(Unknown Source)

at org.apache.log4j.FileAppender.setFile(FileAppender.java:289)

at org.apache.log4j.FileAppender.activateOptions(FileAppender.java:163)

at
org.apache.log4j.DailyRollingFileAppender.activateOptions(DailyRollingFileAp
pender.java:215)

at
org.apache.log4j.config.PropertySetter.activate(PropertySetter.java:256)

at
org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:132
)

at
org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:96)

at
org.apache.log4j.PropertyConfigurator.parseAppender(PropertyConfigurator.jav
a:654)

at
org.apache.log4j.PropertyConfigurator.parseCategory(PropertyConfigurator.jav
a:612)

at
org.apache.log4j.PropertyConfigurator.configureRootCategory(PropertyConfigur
ator.java:509)

at
org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:
415)

at
org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:
441)

at
org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter.
java:468)

at org.apache.log4j.LogManager.clinit(LogManager.java:122)

at org.apache.log4j.Logger.getLogger(Logger.java:104)

at
org.apache.commons.logging.impl.Log4JLogger.getLogger(Log4JLogger.java:229)

at
org.apache.commons.logging.impl.Log4JLogger.init(Log4JLogger.java:65)

at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)



Re: RSS-fecter and index individul-how can i realize this function

2007-01-31 Thread kauu

hi ,
thx any way , but i don't think I tell clearly enough.

what i want  is nutch  just fetch  rss seeds for 1 depth. So  nutch should
just  fetch some xml pages .I don't want to fetch the items' outlink 's
pages, because there r too much spam in those pages.
 so , i just need to parse the rss file.
so when i search some words which in description tag in one xml's item. the
return hit will be like this
title ==one item's title
summary ==one item's description
link ==one itme's outlink.

so , i don't know whether the parse-rss plugin provide this function?

On 1/31/07, Chris Mattmann [EMAIL PROTECTED] wrote:


Hi there,

  With the explanation that you give below, it seems like parse-rss as it
exists would address what you are trying to do. parse-rss parses an RSS
channel as a set of items, and indexes overall metadata about the RSS
file,
including parse text, and index data, but it also adds each item (in the
channel)'s URL as an Outlink, so that Nutch will process those pieces of
content as well. The only thing that you suggest below that parse-rss
currently doesn't do, is to allow you to associate the metadata fields
category:, and author: with the item Outlink...

Cheers,
  Chris



On 1/30/07 7:30 PM, kauu [EMAIL PROTECTED] wrote:

 thx for ur reply .
mybe i didn't tell clearly .
I want to index the item as a
 individual page .then when i search the some
thing for example nutch-open
 source, the nutch return a hit which contain

   title : nutch-open source

 description : nutch nutch nutch nutch  nutch
   url :
 http://lucene.apache.org/nutch
   category : news
  author  : kauu

so , is
 the plugin parse-rss can satisfy what i need?

item
titlenutch--open
 source/title
   description

nutch nutch nutch nutch
 nutch
  /description
 
 
 
 linkhttp://lucene.apache.org/nutch/link
 
 
  categorynews
 /category
 
 
  authorkauu/author



On 1/31/07, Chris
 Mattmann [EMAIL PROTECTED] wrote:

 Hi there,

 I could most
 likely be of assistance, if you gave me some more
 information.
 For
 instance: I'm wondering if the use case you describe below is already

 supported by the current RSS parse plugin?

 The current RSS parser,
 parse-rss, does in fact index individual items
 that
 are pointed to by an
 RSS document. The items are added as Nutch Outlinks,
 and added to the
 overall queue of URLs to fetch. Doesn't this satisfy what
 you mention below?
 Or am I missing something?

 Cheers,
   Chris



 On 1/30/07 6:01 PM,
 kauu [EMAIL PROTECTED] wrote:

  Hi folks :
 
 What's I want to
 do is to separate a rss file into several pages .
 
Just as what has
 been discussed before. I want fetch a rss page and
 index
  it as different
 documents in the index. So the searcher can search the
  Item's info as a
 individual hit.
 
   What's my opinion create a protocol for fetch the rss
 page and store it
 as
  several one which just contain one ITEM tag .but
 the unique key is the
 url ,
  so how can I store them with the ITEM's link
 tag as the unique key for a
  document.
 
So my question is how to
 realize this function in nutch-.0.8.x.
 
I've check the code of the
 plug-in protocol-http's code ,but I can't
  find the code where to store a
 page to a document. I want to separate
 the
  rss page to several ones
 before storing it as a document but several
 ones.
 
So any one can
 give me some hints?
 
  Any reply will be appreciated !
 
 
 
 

 
ITEM's structure
 
   item
 
 
  title欧洲暴风雪后发制人 致航班
 延误交通混乱(组图)/title
 
 
  description暴风雪横扫欧洲,导致多次航班延误 1
 月24日,几架民航客机在德
  国斯图加特机场内等待去除机身上冰雪。1月24日,工作人员在德国南部
 的慕尼黑机场
  清扫飞机跑道上的积雪。 据报道,迟来的暴风雪连续两天横扫中...
 

 
 
  /description
 
 
 
 linkhttp://news.sohu.com/20070125
 
 http://news.sohu.com/20070125/n247833568.shtml /n247833568.shtml/
 
 link
 
 
  category搜狐焦点图新闻/category
 
 
 
 author[EMAIL PROTECTED]
  /author
 
 
  pubDateThu, 25 Jan 2007
 11:29:11 +0800/pubDate
 
 
  comments
 
 http://comment.news.sohu.com
 
 http://comment.news.sohu.com/comment/topic.jsp?id=247833847
 
 /comment/topic.jsp?id=247833847/comments
 
 
  /item
 
 

 





--
www.babatu.com







--
www.babatu.com


Re: log4j problem

2007-01-31 Thread kauu

sorry , i will be careful .thx any way

On 1/31/07, chee wu [EMAIL PROTECTED] wrote:


set the two java arguments-Dhadoop.log.file and -Dhadoop.log.dir
should fix your problem.
btw,not to put much chinese characters in your mail..


- Original Message -
From: kauu [EMAIL PROTECTED]
To: nutch-dev@lucene.apache.org
Sent: Wednesday, January 31, 2007 1:45 PM
Subject: log4j problem


why when I changed the nutch/conf/log4j.properties



I just changed the first line

  Log4j.rootLogger=info,drfa to log4j.rootLogger=debug,drfa

Like this:

***  **


# RootLogger - DailyRollingFileAppender

#log4j.rootLogger=INFO,DRFA

log4j.rootLogger=DEBUG,DRFA



# Logging Threshold

log4j.threshhold=ALL



#special logging requirements for some commandline tools

log4j.logger.org.apache.nutch.crawl.Crawl=INFO,cmdstdout

log4j.logger.org.apache.nutch.crawl.Injector=INFO,cmdstdout

log4j.logger.org.apache.nutch.crawl.Generator=INFO,cmdstdout

log4j.logger.org.apache.nutch.fetcher.Fetcher=INFO,cmdstdout

log4j.logger.org.apache.nutch.parse.ParseSegment=INFO,cmdstdout

log4j.logger.org.apache.nutch.crawl.CrawlDbReader=INFO,cmdstdout

log4j.logger.org.apache.nutch.crawl.CrawlDbMerger=INFO,cmdstdout

log4j.logger.org.apache.nutch.crawl.LinkDbReader=INFO,cmdstdout

log4j.logger.org.apache.nutch.segment.SegmentReader=INFO,cmdstdout

log4j.logger.org.apache.nutch.segment.SegmentMerger=INFO,cmdstdout

log4j.logger.org.apache.nutch.crawl.CrawlDb=INFO,cmdstdout

log4j.logger.org.apache.nutch.crawl.LinkDb=INFO,cmdstdout

log4j.logger.org.apache.nutch.crawl.LinkDbMerger=INFO,cmdstdout

log4j.logger.org.apache.nutch.indexer.Indexer=INFO,cmdstdout

log4j.logger.org.apache.nutch.indexer.DeleteDuplicates=INFO,cmdstdout

log4j.logger.org.apache.nutch.indexer.IndexMerger=INFO,cmdstdout

??

*  **
*

*  In the console ,it show me the error like below







log4j:ERROR setFile(null,true) call failed.

java.io.FileNotFoundException: \ (???)

at java.io.FileOutputStream.openAppend(Native Method)

at java.io.FileOutputStream.init(Unknown Source)

at java.io.FileOutputStream.init(Unknown Source)

at org.apache.log4j.FileAppender.setFile(FileAppender.java:289)

at org.apache.log4j.FileAppender.activateOptions(FileAppender.java
:163)

at
org.apache.log4j.DailyRollingFileAppender.activateOptions
(DailyRollingFileAp
pender.java:215)

at
org.apache.log4j.config.PropertySetter.activate(PropertySetter.java:256)

at
org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java
:132
)

at
org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java
:96)

at
org.apache.log4j.PropertyConfigurator.parseAppender(
PropertyConfigurator.jav
a:654)

at
org.apache.log4j.PropertyConfigurator.parseCategory(
PropertyConfigurator.jav
a:612)

at
org.apache.log4j.PropertyConfigurator.configureRootCategory
(PropertyConfigur
ator.java:509)

at
org.apache.log4j.PropertyConfigurator.doConfigure(
PropertyConfigurator.java:
415)

at
org.apache.log4j.PropertyConfigurator.doConfigure(
PropertyConfigurator.java:
441)

at
org.apache.log4j.helpers.OptionConverter.selectAndConfigure
(OptionConverter.
java:468)

at org.apache.log4j.LogManager.clinit(LogManager.java:122)

at org.apache.log4j.Logger.getLogger(Logger.java:104)

at
org.apache.commons.logging.impl.Log4JLogger.getLogger(Log4JLogger.java
:229)

at
org.apache.commons.logging.impl.Log4JLogger.init(Log4JLogger.java:65)

at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)





--
www.babatu.com


Can't Compile Revision 501954

2007-01-31 Thread Tobias Zahn
hello!
Does anybody know why I get the following error running ant on the
revision I have checked out from svn? Maybe its a dumb question but...

Thank you for your help!

compile:
 [echo] Compiling plugin: parse-html
[javac] Compiling 5 source files to nutch/trunk/build/parse-html/classes
[javac]
nutch/trunk/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java:102:
cannot access org.apache.nutch.parse.HtmlParseFilters
[javac] bad class file:
nutch/trunk/build/classes/org/apache/nutch/parse/HtmlParseFilters.class
[javac] illegal start of class file
[javac] Please remove or make sure it appears in the correct
subdirectory of the classpath.
[javac]   private HtmlParseFilters htmlParseFilters;
[javac]   ^
[javac] 1 error


RE: RSS-fecter and index individul-how can i realize this function

2007-01-31 Thread Gal Nitzan
Hi,

Many sites provide RSS feeds for several reasons, usually to save bandwidth, to 
give the users concentrated data and so forth.

Some of the RSS files supplied by sites are created specially for search 
engines where each RSS item represent a web page in the site.

IMHO the only thing missing in the parse-rss plugin is storing the data in 
the CrawlDatum and parsing it in the next fetch phase. Maybe adding a new 
flag to CrawlDatum, that would flag the URL as parsable not fetchable? 

Just my two cents...

Gal.

-Original Message-
From: Chris Mattmann [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, January 31, 2007 8:44 AM
To: nutch-dev@lucene.apache.org
Subject: Re: RSS-fecter and index individul-how can i realize this function

Hi there,

  With the explanation that you give below, it seems like parse-rss as it
exists would address what you are trying to do. parse-rss parses an RSS
channel as a set of items, and indexes overall metadata about the RSS file,
including parse text, and index data, but it also adds each item (in the
channel)'s URL as an Outlink, so that Nutch will process those pieces of
content as well. The only thing that you suggest below that parse-rss
currently doesn't do, is to allow you to associate the metadata fields
category:, and author: with the item Outlink...

Cheers,
  Chris



On 1/30/07 7:30 PM, kauu [EMAIL PROTECTED] wrote:

 thx for ur reply .
mybe i didn't tell clearly .
 I want to index the item as a
 individual page .then when i search the some
thing for example nutch-open
 source, the nutch return a hit which contain

   title : nutch-open source

 description : nutch nutch nutch nutch  nutch
   url :
 http://lucene.apache.org/nutch
   category : news
  author  : kauu

so , is
 the plugin parse-rss can satisfy what i need?

item
titlenutch--open
 source/title
   description

nutch nutch nutch nutch
 nutch
  /description
 
 
 
 linkhttp://lucene.apache.org/nutch/link
 
 
  categorynews
 /category
 
 
  authorkauu/author



On 1/31/07, Chris
 Mattmann [EMAIL PROTECTED] wrote:

 Hi there,

 I could most
 likely be of assistance, if you gave me some more
 information.
 For
 instance: I'm wondering if the use case you describe below is already

 supported by the current RSS parse plugin?

 The current RSS parser,
 parse-rss, does in fact index individual items
 that
 are pointed to by an
 RSS document. The items are added as Nutch Outlinks,
 and added to the
 overall queue of URLs to fetch. Doesn't this satisfy what
 you mention below?
 Or am I missing something?

 Cheers,
   Chris



 On 1/30/07 6:01 PM,
 kauu [EMAIL PROTECTED] wrote:

  Hi folks :
 
 What's I want to
 do is to separate a rss file into several pages .
 
Just as what has
 been discussed before. I want fetch a rss page and
 index
  it as different
 documents in the index. So the searcher can search the
  Item's info as a
 individual hit.
 
   What's my opinion create a protocol for fetch the rss
 page and store it
 as
  several one which just contain one ITEM tag .but
 the unique key is the
 url ,
  so how can I store them with the ITEM's link
 tag as the unique key for a
  document.
 
So my question is how to
 realize this function in nutch-.0.8.x.
 
I've check the code of the
 plug-in protocol-http's code ,but I can't
  find the code where to store a
 page to a document. I want to separate
 the
  rss page to several ones
 before storing it as a document but several
 ones.
 
So any one can
 give me some hints?
 
  Any reply will be appreciated !
 
 
 
 

 
ITEM's structure
 
   item
 
 
  title欧洲暴风雪后发制人 致航班
 延误交通混乱(组图)/title
 
 
  description暴风雪横扫欧洲,导致多次航班延误 1
 月24日,几架民航客机在德
  国斯图加特机场内等待去除机身上冰雪。1月24日,工作人员在德国南部
 的慕尼黑机场
  清扫飞机跑道上的积雪。 据报道,迟来的暴风雪连续两天横扫中...
 

 
 
  /description
 
 
 
 linkhttp://news.sohu.com/20070125
 
 http://news.sohu.com/20070125/n247833568.shtml /n247833568.shtml/
 
 link
 
 
  category搜狐焦点图新闻/category
 
 
 
 author[EMAIL PROTECTED]
  /author
 
 
  pubDateThu, 25 Jan 2007
 11:29:11 +0800/pubDate
 
 
  comments
 
 http://comment.news.sohu.com
 
 http://comment.news.sohu.com/comment/topic.jsp?id=247833847
 
 /comment/topic.jsp?id=247833847/comments
 
 
  /item
 
 

 





-- 
www.babatu.com