RSS-fecter and index individul-how can i realize this function

2007-01-30 Thread kauu
Hi folks :

   What’s I want to do is to separate a rss file into several pages .

  Just as what has been discussed before. I want fetch a rss page and index
it as different documents in the index. So the searcher can search the
Item’s info as a individual hit.

 What’s my opinion create a protocol for fetch the rss page and store it as
several one which just contain one ITEM tag .but the unique key is the url ,
so how can I store them with the ITEM’s link tag as the unique key for a
document.

  So my question is how to realize this function in nutch-.0.8.x. 

  I’ve check the code of the plug-in protocol-http’s code ,but I can’t
find the code where to store a page to a document. I want to separate the
rss page to several ones before storing it as a document but several ones.

  So any one can give me some hints?

Any reply will be appreciated !

 

 

  ITEM’s structure 

 item


title欧洲暴风雪后发制人 致航班延误交通混乱(组图)/title


description暴风雪横扫欧洲,导致多次航班延误 1月24日,几架民航客机在德
国斯图加特机场内等待去除机身上冰雪。1月24日,工作人员在德国南部的慕尼黑机场
清扫飞机跑道上的积雪。  据报道,迟来的暴风雪连续两天横扫中...



/description


linkhttp://news.sohu.com/20070125
http://news.sohu.com/20070125/n247833568.shtml /n247833568.shtml/
link


category搜狐焦点图新闻/category


author[EMAIL PROTECTED]
/author


pubDateThu, 25 Jan 2007 11:29:11 +0800/pubDate


comments
http://comment.news.sohu.com
http://comment.news.sohu.com/comment/topic.jsp?id=247833847
/comment/topic.jsp?id=247833847/comments


/item

 



Re: RSS-fecter and index individul-how can i realize this function

2007-01-30 Thread Chris Mattmann
Hi there,

 I could most likely be of assistance, if you gave me some more information.
For instance: I'm wondering if the use case you describe below is already
supported by the current RSS parse plugin?

 The current RSS parser, parse-rss, does in fact index individual items that
are pointed to by an RSS document. The items are added as Nutch Outlinks,
and added to the overall queue of URLs to fetch. Doesn't this satisfy what
you mention below? Or am I missing something?

Cheers,
  Chris



On 1/30/07 6:01 PM, kauu [EMAIL PROTECTED] wrote:

 Hi folks :
 
What’s I want to do is to separate a rss file into several pages .
 
   Just as what has been discussed before. I want fetch a rss page and index
 it as different documents in the index. So the searcher can search the
 Item’s info as a individual hit.
 
  What’s my opinion create a protocol for fetch the rss page and store it as
 several one which just contain one ITEM tag .but the unique key is the url ,
 so how can I store them with the ITEM’s link tag as the unique key for a
 document.
 
   So my question is how to realize this function in nutch-.0.8.x.
 
   I’ve check the code of the plug-in protocol-http’s code ,but I can’t
 find the code where to store a page to a document. I want to separate the
 rss page to several ones before storing it as a document but several ones.
 
   So any one can give me some hints?
 
 Any reply will be appreciated !
 
  
 
  
 
   ITEM’s structure
 
  item
 
 
 title欧洲暴风雪后发制人 致航班延误交通混乱(组图)/title
 
 
 description暴风雪横扫欧洲,导致多次航班延误 1月24日,几架民航客机在德
 国斯图加特机场内等待去除机身上冰雪。1月24日,工作人员在德国南部的慕尼黑机场
 清扫飞机跑道上的积雪。  据报道,迟来的暴风雪连续两天横扫中...
 
 
 
 /description
 
 
 linkhttp://news.sohu.com/20070125
 http://news.sohu.com/20070125/n247833568.shtml /n247833568.shtml/
 link
 
 
 category搜狐焦点图新闻/category
 
 
 author[EMAIL PROTECTED]
 /author
 
 
 pubDateThu, 25 Jan 2007 11:29:11 +0800/pubDate
 
 
 comments
 http://comment.news.sohu.com
 http://comment.news.sohu.com/comment/topic.jsp?id=247833847
 /comment/topic.jsp?id=247833847/comments
 
 
 /item
 
  
 




Re: RSS-fecter and index individul-how can i realize this function

2007-01-30 Thread kauu

thx for ur reply .
mybe i didn't tell clearly .
I want to index the item as a individual page .then when i search the some
thing for example nutch-open source, the nutch return a hit which contain

  title : nutch-open source
  description : nutch nutch nutch nutch  nutch
  url : http://lucene.apache.org/nutch
  category : news
 author  : kauu

so , is the plugin parse-rss can satisfy what i need?

item
   titlenutch--open source/title
  description


   nutch nutch nutch nutch  nutch
 /description


 linkhttp://lucene.apache.org/nutch/link


 categorynews /category


 authorkauu/author




On 1/31/07, Chris Mattmann [EMAIL PROTECTED] wrote:


Hi there,

I could most likely be of assistance, if you gave me some more
information.
For instance: I'm wondering if the use case you describe below is already
supported by the current RSS parse plugin?

The current RSS parser, parse-rss, does in fact index individual items
that
are pointed to by an RSS document. The items are added as Nutch Outlinks,
and added to the overall queue of URLs to fetch. Doesn't this satisfy what
you mention below? Or am I missing something?

Cheers,
  Chris



On 1/30/07 6:01 PM, kauu [EMAIL PROTECTED] wrote:

 Hi folks :

What's I want to do is to separate a rss file into several pages .

   Just as what has been discussed before. I want fetch a rss page and
index
 it as different documents in the index. So the searcher can search the
 Item's info as a individual hit.

  What's my opinion create a protocol for fetch the rss page and store it
as
 several one which just contain one ITEM tag .but the unique key is the
url ,
 so how can I store them with the ITEM's link tag as the unique key for a
 document.

   So my question is how to realize this function in nutch-.0.8.x.

   I've check the code of the plug-in protocol-http's code ,but I can't
 find the code where to store a page to a document. I want to separate
the
 rss page to several ones before storing it as a document but several
ones.

   So any one can give me some hints?

 Any reply will be appreciated !





   ITEM's structure

  item


 title欧洲暴风雪后发制人 致航班延误交通混乱(组图)/title


 description暴风雪横扫欧洲,导致多次航班延误 1月24日,几架民航客机在德
 国斯图加特机场内等待去除机身上冰雪。1月24日,工作人员在德国南部的慕尼黑机场
 清扫飞机跑道上的积雪。 据报道,迟来的暴风雪连续两天横扫中...



 /description


 linkhttp://news.sohu.com/20070125
 http://news.sohu.com/20070125/n247833568.shtml /n247833568.shtml/
 link


 category搜狐焦点图新闻/category


 author[EMAIL PROTECTED]
 /author


 pubDateThu, 25 Jan 2007 11:29:11 +0800/pubDate


 comments
 http://comment.news.sohu.com
 http://comment.news.sohu.com/comment/topic.jsp?id=247833847
 /comment/topic.jsp?id=247833847/comments


 /item









--
www.babatu.com


Re: RSS-fecter and index individul-how can i realize this function

2007-01-30 Thread Chris Mattmann
Hi there,

On 1/30/07 7:00 PM, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:

 Chris,
 
 I saw your name associated with the rss parser in nutch.  My understanding is
 that nutch is using feedparser.  I had two questions:
 
 1.  Have you looked at vtd as an rss parser?

I haven't in fact; what are its benefits over those of commons-feedparser?

 2.  Any view on asynchronous communication as the underlying protocol?  I do
 not believe that feedparser uses that at this point.

I'm not sure exactly what asynchronous communication when parsing rss feeds
affords you: what type of communications are you talking about above? Nutch
handles the communications layer for fetching content using a pluggable,
Protocol-based model. The only feature that Nutch's rss parser uses from the
underlying feedparser library is its object model and callback framework for
parsing RSS/Atom/Feed XML documents. When you mention asynchronous above,
are you talking about the protocol for fetching the different RSS documents?

Thanks!

Cheers,
  Chris


 
 Thanks
   
 
 -Original Message-
 From: Chris Mattmann [EMAIL PROTECTED]
 Date: Tue, 30 Jan 2007 18:16:44
 To:nutch-dev@lucene.apache.org
 Subject: Re: RSS-fecter and index individul-how can i realize this function
 
 Hi there,
 
  I could most likely be of assistance, if you gave me some more information.
 For instance: I'm wondering if the use case you describe below is already
 supported by the current RSS parse plugin?
 
  The current RSS parser, parse-rss, does in fact index individual items that
 are pointed to by an RSS document. The items are added as Nutch Outlinks,
 and added to the overall queue of URLs to fetch. Doesn't this satisfy what
 you mention below? Or am I missing something?
 
 Cheers,
   Chris
 
 
 
 On 1/30/07 6:01 PM, kauu [EMAIL PROTECTED] wrote:
 
 Hi folks :
 
What’s I want to do is to separate a rss file into several pages .
 
   Just as what has been discussed before. I want fetch a rss page and index
 it as different documents in the index. So the searcher can search the
 Item’s info as a individual hit.
 
  What’s my opinion create a protocol for fetch the rss page and store it as
 several one which just contain one ITEM tag .but the unique key is the url ,
 so how can I store them with the ITEM’s link tag as the unique key for a
 document.
 
   So my question is how to realize this function in nutch-.0.8.x.
 
   I’ve check the code of the plug-in protocol-http’s code ,but I can’t
 find the code where to store a page to a document. I want to separate the
 rss page to several ones before storing it as a document but several ones.
 
   So any one can give me some hints?
 
 Any reply will be appreciated !
 
  
 
  
 
   ITEM’s structure
 
  item
 
 
 title欧洲暴风雪后发制人 致航班延误交通混乱(组图)/title
 
 
 description暴风雪横扫欧洲,导致多次航班延误 1月24日,几架民航客机在德
 国斯图加特机场内等待去除机身上冰雪。1月24日,工作人员在德国南部的慕尼黑机场
 清扫飞机跑道上的积雪。  据报道,迟来的暴风雪连续两天横扫中...
 
 
 
 /description
 
 
 linkhttp://news.sohu.com/20070125
 http://news.sohu.com/20070125/n247833568.shtml /n247833568.shtml/
 link
 
 
 category搜狐焦点图新闻/category
 
 
 author[EMAIL PROTECTED]
 /author
 
 
 pubDateThu, 25 Jan 2007 11:29:11 +0800/pubDate
 
 
 comments
 http://comment.news.sohu.com
 http://comment.news.sohu.com/comment/topic.jsp?id=247833847
 /comment/topic.jsp?id=247833847/comments
 
 
 /item
 
  
 
 
 




Re: RSS-fecter and index individul-how can i realize this function

2007-01-30 Thread pdecrem

1.   Claims to be faster
2.   Asynchronous should take care of sitting and waiting for one fetch to 
return before you do the next. 

Ps I am not sure if you checked out tailrank.com for that branch of feedparser 
(I think its in code.tailrank.com/feedparser)

Thanks


  

-Original Message-
From: Chris Mattmann [EMAIL PROTECTED]
Date: Tue, 30 Jan 2007 19:34:49 
To:nutch-dev@lucene.apache.org
Subject: Re: RSS-fecter and index individul-how can i realize this function

Hi there,

On 1/30/07 7:00 PM, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:

 Chris,
 
 I saw your name associated with the rss parser in nutch.  My understanding is
 that nutch is using feedparser.  I had two questions:
 
 1.  Have you looked at vtd as an rss parser?

I haven't in fact; what are its benefits over those of commons-feedparser?

 2.  Any view on asynchronous communication as the underlying protocol?  I do
 not believe that feedparser uses that at this point.

I'm not sure exactly what asynchronous communication when parsing rss feeds
affords you: what type of communications are you talking about above? Nutch
handles the communications layer for fetching content using a pluggable,
Protocol-based model. The only feature that Nutch's rss parser uses from the
underlying feedparser library is its object model and callback framework for
parsing RSS/Atom/Feed XML documents. When you mention asynchronous above,
are you talking about the protocol for fetching the different RSS documents?

Thanks!

Cheers,
  Chris


 
 Thanks
   
 
 -Original Message-
 From: Chris Mattmann [EMAIL PROTECTED]
 Date: Tue, 30 Jan 2007 18:16:44
 To:nutch-dev@lucene.apache.org
 Subject: Re: RSS-fecter and index individul-how can i realize this function
 
 Hi there,
 
  I could most likely be of assistance, if you gave me some more information.
 For instance: I'm wondering if the use case you describe below is already
 supported by the current RSS parse plugin?
 
  The current RSS parser, parse-rss, does in fact index individual items that
 are pointed to by an RSS document. The items are added as Nutch Outlinks,
 and added to the overall queue of URLs to fetch. Doesn't this satisfy what
 you mention below? Or am I missing something?
 
 Cheers,
   Chris
 
 
 
 On 1/30/07 6:01 PM, kauu [EMAIL PROTECTED] wrote:
 
 Hi folks :
 
What’s I want to do is to separate a rss file into several pages .
 
   Just as what has been discussed before. I want fetch a rss page and index
 it as different documents in the index. So the searcher can search the
 Item’s info as a individual hit.
 
  What’s my opinion create a protocol for fetch the rss page and store it as
 several one which just contain one ITEM tag .but the unique key is the url ,
 so how can I store them with the ITEM’s link tag as the unique key for a
 document.
 
   So my question is how to realize this function in nutch-.0.8.x.
 
   I’ve check the code of the plug-in protocol-http’s code ,but I can’t
 find the code where to store a page to a document. I want to separate the
 rss page to several ones before storing it as a document but several ones.
 
   So any one can give me some hints?
 
 Any reply will be appreciated !
 
  
 
  
 
   ITEM’s structure
 
  item
 
 
 title欧洲暴风雪后发制人 致航班延误交通混乱(组图)/title
 
 
 description暴风雪横扫欧洲,导致多次航班延误 1月24日,几架民航客机在德
 国斯图加特机场内等待去除机身上冰雪。1月24日,工作人员在德国南部的慕尼黑机场
 清扫飞机跑道上的积雪。  据报道,迟来的暴风雪连续两天横扫中...
 
 
 
 /description
 
 
 linkhttp://news.sohu.com/20070125
 http://news.sohu.com/20070125/n247833568.shtml /n247833568.shtml/
 link
 
 
 category搜狐焦点图新闻/category
 
 
 author[EMAIL PROTECTED]
 /author
 
 
 pubDateThu, 25 Jan 2007 11:29:11 +0800/pubDate
 
 
 comments
 http://comment.news.sohu.com
 http://comment.news.sohu.com/comment/topic.jsp?id=247833847
 /comment/topic.jsp?id=247833847/comments
 
 
 /item
 
  
 
 
 




log4j problem

2007-01-30 Thread kauu
why when I changed the nutch/conf/log4j.properties

 

I just changed the first line 

  Log4j.rootLogger=info,drfa to log4j.rootLogger=debug,drfa

Like this:

***  **


# RootLogger - DailyRollingFileAppender

#log4j.rootLogger=INFO,DRFA

log4j.rootLogger=DEBUG,DRFA

 

# Logging Threshold

log4j.threshhold=ALL

 

#special logging requirements for some commandline tools

log4j.logger.org.apache.nutch.crawl.Crawl=INFO,cmdstdout

log4j.logger.org.apache.nutch.crawl.Injector=INFO,cmdstdout

log4j.logger.org.apache.nutch.crawl.Generator=INFO,cmdstdout

log4j.logger.org.apache.nutch.fetcher.Fetcher=INFO,cmdstdout

log4j.logger.org.apache.nutch.parse.ParseSegment=INFO,cmdstdout

log4j.logger.org.apache.nutch.crawl.CrawlDbReader=INFO,cmdstdout

log4j.logger.org.apache.nutch.crawl.CrawlDbMerger=INFO,cmdstdout

log4j.logger.org.apache.nutch.crawl.LinkDbReader=INFO,cmdstdout

log4j.logger.org.apache.nutch.segment.SegmentReader=INFO,cmdstdout

log4j.logger.org.apache.nutch.segment.SegmentMerger=INFO,cmdstdout

log4j.logger.org.apache.nutch.crawl.CrawlDb=INFO,cmdstdout

log4j.logger.org.apache.nutch.crawl.LinkDb=INFO,cmdstdout

log4j.logger.org.apache.nutch.crawl.LinkDbMerger=INFO,cmdstdout

log4j.logger.org.apache.nutch.indexer.Indexer=INFO,cmdstdout

log4j.logger.org.apache.nutch.indexer.DeleteDuplicates=INFO,cmdstdout

log4j.logger.org.apache.nutch.indexer.IndexMerger=INFO,cmdstdout 

。。

*  **
*

*  In the console ,it show me the error like below

 

 

 

log4j:ERROR setFile(null,true) call failed.

java.io.FileNotFoundException: \ (系统找不到指定的路径。)

at java.io.FileOutputStream.openAppend(Native Method)

at java.io.FileOutputStream.init(Unknown Source)

at java.io.FileOutputStream.init(Unknown Source)

at org.apache.log4j.FileAppender.setFile(FileAppender.java:289)

at org.apache.log4j.FileAppender.activateOptions(FileAppender.java:163)

at
org.apache.log4j.DailyRollingFileAppender.activateOptions(DailyRollingFileAp
pender.java:215)

at
org.apache.log4j.config.PropertySetter.activate(PropertySetter.java:256)

at
org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:132
)

at
org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:96)

at
org.apache.log4j.PropertyConfigurator.parseAppender(PropertyConfigurator.jav
a:654)

at
org.apache.log4j.PropertyConfigurator.parseCategory(PropertyConfigurator.jav
a:612)

at
org.apache.log4j.PropertyConfigurator.configureRootCategory(PropertyConfigur
ator.java:509)

at
org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:
415)

at
org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:
441)

at
org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter.
java:468)

at org.apache.log4j.LogManager.clinit(LogManager.java:122)

at org.apache.log4j.Logger.getLogger(Logger.java:104)

at
org.apache.commons.logging.impl.Log4JLogger.getLogger(Log4JLogger.java:229)

at
org.apache.commons.logging.impl.Log4JLogger.init(Log4JLogger.java:65)

at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)



Re: RSS-fecter and index individul-how can i realize this function

2007-01-30 Thread Chris Mattmann
Hi there,

  With the explanation that you give below, it seems like parse-rss as it
exists would address what you are trying to do. parse-rss parses an RSS
channel as a set of items, and indexes overall metadata about the RSS file,
including parse text, and index data, but it also adds each item (in the
channel)'s URL as an Outlink, so that Nutch will process those pieces of
content as well. The only thing that you suggest below that parse-rss
currently doesn't do, is to allow you to associate the metadata fields
category:, and author: with the item Outlink...

Cheers,
  Chris



On 1/30/07 7:30 PM, kauu [EMAIL PROTECTED] wrote:

 thx for ur reply .
mybe i didn't tell clearly .
 I want to index the item as a
 individual page .then when i search the some
thing for example nutch-open
 source, the nutch return a hit which contain

   title : nutch-open source

 description : nutch nutch nutch nutch  nutch
   url :
 http://lucene.apache.org/nutch
   category : news
  author  : kauu

so , is
 the plugin parse-rss can satisfy what i need?

item
titlenutch--open
 source/title
   description

nutch nutch nutch nutch
 nutch
  /description
 
 
 
 linkhttp://lucene.apache.org/nutch/link
 
 
  categorynews
 /category
 
 
  authorkauu/author



On 1/31/07, Chris
 Mattmann [EMAIL PROTECTED] wrote:

 Hi there,

 I could most
 likely be of assistance, if you gave me some more
 information.
 For
 instance: I'm wondering if the use case you describe below is already

 supported by the current RSS parse plugin?

 The current RSS parser,
 parse-rss, does in fact index individual items
 that
 are pointed to by an
 RSS document. The items are added as Nutch Outlinks,
 and added to the
 overall queue of URLs to fetch. Doesn't this satisfy what
 you mention below?
 Or am I missing something?

 Cheers,
   Chris



 On 1/30/07 6:01 PM,
 kauu [EMAIL PROTECTED] wrote:

  Hi folks :
 
 What's I want to
 do is to separate a rss file into several pages .
 
Just as what has
 been discussed before. I want fetch a rss page and
 index
  it as different
 documents in the index. So the searcher can search the
  Item's info as a
 individual hit.
 
   What's my opinion create a protocol for fetch the rss
 page and store it
 as
  several one which just contain one ITEM tag .but
 the unique key is the
 url ,
  so how can I store them with the ITEM's link
 tag as the unique key for a
  document.
 
So my question is how to
 realize this function in nutch-.0.8.x.
 
I've check the code of the
 plug-in protocol-http's code ,but I can't
  find the code where to store a
 page to a document. I want to separate
 the
  rss page to several ones
 before storing it as a document but several
 ones.
 
So any one can
 give me some hints?
 
  Any reply will be appreciated !
 
 
 
 

 
ITEM's structure
 
   item
 
 
  title欧洲暴风雪后发制人 致航班
 延误交通混乱(组图)/title
 
 
  description暴风雪横扫欧洲,导致多次航班延误 1
 月24日,几架民航客机在德
  国斯图加特机场内等待去除机身上冰雪。1月24日,工作人员在德国南部
 的慕尼黑机场
  清扫飞机跑道上的积雪。 据报道,迟来的暴风雪连续两天横扫中...
 

 
 
  /description
 
 
 
 linkhttp://news.sohu.com/20070125
 
 http://news.sohu.com/20070125/n247833568.shtml /n247833568.shtml/
 
 link
 
 
  category搜狐焦点图新闻/category
 
 
 
 author[EMAIL PROTECTED]
  /author
 
 
  pubDateThu, 25 Jan 2007
 11:29:11 +0800/pubDate
 
 
  comments
 
 http://comment.news.sohu.com
 
 http://comment.news.sohu.com/comment/topic.jsp?id=247833847
 
 /comment/topic.jsp?id=247833847/comments
 
 
  /item
 
 

 





--
www.babatu.com