RSS-fecter and index individul-how can i realize this function
Hi folks : What’s I want to do is to separate a rss file into several pages . Just as what has been discussed before. I want fetch a rss page and index it as different documents in the index. So the searcher can search the Item’s info as a individual hit. What’s my opinion create a protocol for fetch the rss page and store it as several one which just contain one ITEM tag .but the unique key is the url , so how can I store them with the ITEM’s link tag as the unique key for a document. So my question is how to realize this function in nutch-.0.8.x. I’ve check the code of the plug-in protocol-http’s code ,but I can’t find the code where to store a page to a document. I want to separate the rss page to several ones before storing it as a document but several ones. So any one can give me some hints? Any reply will be appreciated ! ITEM’s structure item title欧洲暴风雪后发制人 致航班延误交通混乱(组图)/title description暴风雪横扫欧洲,导致多次航班延误 1月24日,几架民航客机在德 国斯图加特机场内等待去除机身上冰雪。1月24日,工作人员在德国南部的慕尼黑机场 清扫飞机跑道上的积雪。 据报道,迟来的暴风雪连续两天横扫中... /description linkhttp://news.sohu.com/20070125 http://news.sohu.com/20070125/n247833568.shtml /n247833568.shtml/ link category搜狐焦点图新闻/category author[EMAIL PROTECTED] /author pubDateThu, 25 Jan 2007 11:29:11 +0800/pubDate comments http://comment.news.sohu.com http://comment.news.sohu.com/comment/topic.jsp?id=247833847 /comment/topic.jsp?id=247833847/comments /item
Re: RSS-fecter and index individul-how can i realize this function
Hi there, I could most likely be of assistance, if you gave me some more information. For instance: I'm wondering if the use case you describe below is already supported by the current RSS parse plugin? The current RSS parser, parse-rss, does in fact index individual items that are pointed to by an RSS document. The items are added as Nutch Outlinks, and added to the overall queue of URLs to fetch. Doesn't this satisfy what you mention below? Or am I missing something? Cheers, Chris On 1/30/07 6:01 PM, kauu [EMAIL PROTECTED] wrote: Hi folks : What’s I want to do is to separate a rss file into several pages . Just as what has been discussed before. I want fetch a rss page and index it as different documents in the index. So the searcher can search the Item’s info as a individual hit. What’s my opinion create a protocol for fetch the rss page and store it as several one which just contain one ITEM tag .but the unique key is the url , so how can I store them with the ITEM’s link tag as the unique key for a document. So my question is how to realize this function in nutch-.0.8.x. I’ve check the code of the plug-in protocol-http’s code ,but I can’t find the code where to store a page to a document. I want to separate the rss page to several ones before storing it as a document but several ones. So any one can give me some hints? Any reply will be appreciated ! ITEM’s structure item title欧洲暴风雪后发制人 致航班延误交通混乱(组图)/title description暴风雪横扫欧洲,导致多次航班延误 1月24日,几架民航客机在德 国斯图加特机场内等待去除机身上冰雪。1月24日,工作人员在德国南部的慕尼黑机场 清扫飞机跑道上的积雪。 据报道,迟来的暴风雪连续两天横扫中... /description linkhttp://news.sohu.com/20070125 http://news.sohu.com/20070125/n247833568.shtml /n247833568.shtml/ link category搜狐焦点图新闻/category author[EMAIL PROTECTED] /author pubDateThu, 25 Jan 2007 11:29:11 +0800/pubDate comments http://comment.news.sohu.com http://comment.news.sohu.com/comment/topic.jsp?id=247833847 /comment/topic.jsp?id=247833847/comments /item
Re: RSS-fecter and index individul-how can i realize this function
thx for ur reply . mybe i didn't tell clearly . I want to index the item as a individual page .then when i search the some thing for example nutch-open source, the nutch return a hit which contain title : nutch-open source description : nutch nutch nutch nutch nutch url : http://lucene.apache.org/nutch category : news author : kauu so , is the plugin parse-rss can satisfy what i need? item titlenutch--open source/title description nutch nutch nutch nutch nutch /description linkhttp://lucene.apache.org/nutch/link categorynews /category authorkauu/author On 1/31/07, Chris Mattmann [EMAIL PROTECTED] wrote: Hi there, I could most likely be of assistance, if you gave me some more information. For instance: I'm wondering if the use case you describe below is already supported by the current RSS parse plugin? The current RSS parser, parse-rss, does in fact index individual items that are pointed to by an RSS document. The items are added as Nutch Outlinks, and added to the overall queue of URLs to fetch. Doesn't this satisfy what you mention below? Or am I missing something? Cheers, Chris On 1/30/07 6:01 PM, kauu [EMAIL PROTECTED] wrote: Hi folks : What's I want to do is to separate a rss file into several pages . Just as what has been discussed before. I want fetch a rss page and index it as different documents in the index. So the searcher can search the Item's info as a individual hit. What's my opinion create a protocol for fetch the rss page and store it as several one which just contain one ITEM tag .but the unique key is the url , so how can I store them with the ITEM's link tag as the unique key for a document. So my question is how to realize this function in nutch-.0.8.x. I've check the code of the plug-in protocol-http's code ,but I can't find the code where to store a page to a document. I want to separate the rss page to several ones before storing it as a document but several ones. So any one can give me some hints? Any reply will be appreciated ! ITEM's structure item title欧洲暴风雪后发制人 致航班延误交通混乱(组图)/title description暴风雪横扫欧洲,导致多次航班延误 1月24日,几架民航客机在德 国斯图加特机场内等待去除机身上冰雪。1月24日,工作人员在德国南部的慕尼黑机场 清扫飞机跑道上的积雪。 据报道,迟来的暴风雪连续两天横扫中... /description linkhttp://news.sohu.com/20070125 http://news.sohu.com/20070125/n247833568.shtml /n247833568.shtml/ link category搜狐焦点图新闻/category author[EMAIL PROTECTED] /author pubDateThu, 25 Jan 2007 11:29:11 +0800/pubDate comments http://comment.news.sohu.com http://comment.news.sohu.com/comment/topic.jsp?id=247833847 /comment/topic.jsp?id=247833847/comments /item -- www.babatu.com
Re: RSS-fecter and index individul-how can i realize this function
Hi there, On 1/30/07 7:00 PM, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Chris, I saw your name associated with the rss parser in nutch. My understanding is that nutch is using feedparser. I had two questions: 1. Have you looked at vtd as an rss parser? I haven't in fact; what are its benefits over those of commons-feedparser? 2. Any view on asynchronous communication as the underlying protocol? I do not believe that feedparser uses that at this point. I'm not sure exactly what asynchronous communication when parsing rss feeds affords you: what type of communications are you talking about above? Nutch handles the communications layer for fetching content using a pluggable, Protocol-based model. The only feature that Nutch's rss parser uses from the underlying feedparser library is its object model and callback framework for parsing RSS/Atom/Feed XML documents. When you mention asynchronous above, are you talking about the protocol for fetching the different RSS documents? Thanks! Cheers, Chris Thanks -Original Message- From: Chris Mattmann [EMAIL PROTECTED] Date: Tue, 30 Jan 2007 18:16:44 To:nutch-dev@lucene.apache.org Subject: Re: RSS-fecter and index individul-how can i realize this function Hi there, I could most likely be of assistance, if you gave me some more information. For instance: I'm wondering if the use case you describe below is already supported by the current RSS parse plugin? The current RSS parser, parse-rss, does in fact index individual items that are pointed to by an RSS document. The items are added as Nutch Outlinks, and added to the overall queue of URLs to fetch. Doesn't this satisfy what you mention below? Or am I missing something? Cheers, Chris On 1/30/07 6:01 PM, kauu [EMAIL PROTECTED] wrote: Hi folks : What’s I want to do is to separate a rss file into several pages . Just as what has been discussed before. I want fetch a rss page and index it as different documents in the index. So the searcher can search the Item’s info as a individual hit. What’s my opinion create a protocol for fetch the rss page and store it as several one which just contain one ITEM tag .but the unique key is the url , so how can I store them with the ITEM’s link tag as the unique key for a document. So my question is how to realize this function in nutch-.0.8.x. I’ve check the code of the plug-in protocol-http’s code ,but I can’t find the code where to store a page to a document. I want to separate the rss page to several ones before storing it as a document but several ones. So any one can give me some hints? Any reply will be appreciated ! ITEM’s structure item title欧洲暴风雪后发制人 致航班延误交通混乱(组图)/title description暴风雪横扫欧洲,导致多次航班延误 1月24日,几架民航客机在德 国斯图加特机场内等待去除机身上冰雪。1月24日,工作人员在德国南部的慕尼黑机场 清扫飞机跑道上的积雪。 据报道,迟来的暴风雪连续两天横扫中... /description linkhttp://news.sohu.com/20070125 http://news.sohu.com/20070125/n247833568.shtml /n247833568.shtml/ link category搜狐焦点图新闻/category author[EMAIL PROTECTED] /author pubDateThu, 25 Jan 2007 11:29:11 +0800/pubDate comments http://comment.news.sohu.com http://comment.news.sohu.com/comment/topic.jsp?id=247833847 /comment/topic.jsp?id=247833847/comments /item
Re: RSS-fecter and index individul-how can i realize this function
1. Claims to be faster 2. Asynchronous should take care of sitting and waiting for one fetch to return before you do the next. Ps I am not sure if you checked out tailrank.com for that branch of feedparser (I think its in code.tailrank.com/feedparser) Thanks -Original Message- From: Chris Mattmann [EMAIL PROTECTED] Date: Tue, 30 Jan 2007 19:34:49 To:nutch-dev@lucene.apache.org Subject: Re: RSS-fecter and index individul-how can i realize this function Hi there, On 1/30/07 7:00 PM, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Chris, I saw your name associated with the rss parser in nutch. My understanding is that nutch is using feedparser. I had two questions: 1. Have you looked at vtd as an rss parser? I haven't in fact; what are its benefits over those of commons-feedparser? 2. Any view on asynchronous communication as the underlying protocol? I do not believe that feedparser uses that at this point. I'm not sure exactly what asynchronous communication when parsing rss feeds affords you: what type of communications are you talking about above? Nutch handles the communications layer for fetching content using a pluggable, Protocol-based model. The only feature that Nutch's rss parser uses from the underlying feedparser library is its object model and callback framework for parsing RSS/Atom/Feed XML documents. When you mention asynchronous above, are you talking about the protocol for fetching the different RSS documents? Thanks! Cheers, Chris Thanks -Original Message- From: Chris Mattmann [EMAIL PROTECTED] Date: Tue, 30 Jan 2007 18:16:44 To:nutch-dev@lucene.apache.org Subject: Re: RSS-fecter and index individul-how can i realize this function Hi there, I could most likely be of assistance, if you gave me some more information. For instance: I'm wondering if the use case you describe below is already supported by the current RSS parse plugin? The current RSS parser, parse-rss, does in fact index individual items that are pointed to by an RSS document. The items are added as Nutch Outlinks, and added to the overall queue of URLs to fetch. Doesn't this satisfy what you mention below? Or am I missing something? Cheers, Chris On 1/30/07 6:01 PM, kauu [EMAIL PROTECTED] wrote: Hi folks : What’s I want to do is to separate a rss file into several pages . Just as what has been discussed before. I want fetch a rss page and index it as different documents in the index. So the searcher can search the Item’s info as a individual hit. What’s my opinion create a protocol for fetch the rss page and store it as several one which just contain one ITEM tag .but the unique key is the url , so how can I store them with the ITEM’s link tag as the unique key for a document. So my question is how to realize this function in nutch-.0.8.x. I’ve check the code of the plug-in protocol-http’s code ,but I can’t find the code where to store a page to a document. I want to separate the rss page to several ones before storing it as a document but several ones. So any one can give me some hints? Any reply will be appreciated ! ITEM’s structure item title欧洲暴风雪后发制人 致航班延误交通混乱(组图)/title description暴风雪横扫欧洲,导致多次航班延误 1月24日,几架民航客机在德 国斯图加特机场内等待去除机身上冰雪。1月24日,工作人员在德国南部的慕尼黑机场 清扫飞机跑道上的积雪。 据报道,迟来的暴风雪连续两天横扫中... /description linkhttp://news.sohu.com/20070125 http://news.sohu.com/20070125/n247833568.shtml /n247833568.shtml/ link category搜狐焦点图新闻/category author[EMAIL PROTECTED] /author pubDateThu, 25 Jan 2007 11:29:11 +0800/pubDate comments http://comment.news.sohu.com http://comment.news.sohu.com/comment/topic.jsp?id=247833847 /comment/topic.jsp?id=247833847/comments /item
log4j problem
why when I changed the nutch/conf/log4j.properties I just changed the first line Log4j.rootLogger=info,drfa to log4j.rootLogger=debug,drfa Like this: *** ** # RootLogger - DailyRollingFileAppender #log4j.rootLogger=INFO,DRFA log4j.rootLogger=DEBUG,DRFA # Logging Threshold log4j.threshhold=ALL #special logging requirements for some commandline tools log4j.logger.org.apache.nutch.crawl.Crawl=INFO,cmdstdout log4j.logger.org.apache.nutch.crawl.Injector=INFO,cmdstdout log4j.logger.org.apache.nutch.crawl.Generator=INFO,cmdstdout log4j.logger.org.apache.nutch.fetcher.Fetcher=INFO,cmdstdout log4j.logger.org.apache.nutch.parse.ParseSegment=INFO,cmdstdout log4j.logger.org.apache.nutch.crawl.CrawlDbReader=INFO,cmdstdout log4j.logger.org.apache.nutch.crawl.CrawlDbMerger=INFO,cmdstdout log4j.logger.org.apache.nutch.crawl.LinkDbReader=INFO,cmdstdout log4j.logger.org.apache.nutch.segment.SegmentReader=INFO,cmdstdout log4j.logger.org.apache.nutch.segment.SegmentMerger=INFO,cmdstdout log4j.logger.org.apache.nutch.crawl.CrawlDb=INFO,cmdstdout log4j.logger.org.apache.nutch.crawl.LinkDb=INFO,cmdstdout log4j.logger.org.apache.nutch.crawl.LinkDbMerger=INFO,cmdstdout log4j.logger.org.apache.nutch.indexer.Indexer=INFO,cmdstdout log4j.logger.org.apache.nutch.indexer.DeleteDuplicates=INFO,cmdstdout log4j.logger.org.apache.nutch.indexer.IndexMerger=INFO,cmdstdout 。。 * ** * * In the console ,it show me the error like below log4j:ERROR setFile(null,true) call failed. java.io.FileNotFoundException: \ (系统找不到指定的路径。) at java.io.FileOutputStream.openAppend(Native Method) at java.io.FileOutputStream.init(Unknown Source) at java.io.FileOutputStream.init(Unknown Source) at org.apache.log4j.FileAppender.setFile(FileAppender.java:289) at org.apache.log4j.FileAppender.activateOptions(FileAppender.java:163) at org.apache.log4j.DailyRollingFileAppender.activateOptions(DailyRollingFileAp pender.java:215) at org.apache.log4j.config.PropertySetter.activate(PropertySetter.java:256) at org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:132 ) at org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:96) at org.apache.log4j.PropertyConfigurator.parseAppender(PropertyConfigurator.jav a:654) at org.apache.log4j.PropertyConfigurator.parseCategory(PropertyConfigurator.jav a:612) at org.apache.log4j.PropertyConfigurator.configureRootCategory(PropertyConfigur ator.java:509) at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java: 415) at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java: 441) at org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter. java:468) at org.apache.log4j.LogManager.clinit(LogManager.java:122) at org.apache.log4j.Logger.getLogger(Logger.java:104) at org.apache.commons.logging.impl.Log4JLogger.getLogger(Log4JLogger.java:229) at org.apache.commons.logging.impl.Log4JLogger.init(Log4JLogger.java:65) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
Re: RSS-fecter and index individul-how can i realize this function
Hi there, With the explanation that you give below, it seems like parse-rss as it exists would address what you are trying to do. parse-rss parses an RSS channel as a set of items, and indexes overall metadata about the RSS file, including parse text, and index data, but it also adds each item (in the channel)'s URL as an Outlink, so that Nutch will process those pieces of content as well. The only thing that you suggest below that parse-rss currently doesn't do, is to allow you to associate the metadata fields category:, and author: with the item Outlink... Cheers, Chris On 1/30/07 7:30 PM, kauu [EMAIL PROTECTED] wrote: thx for ur reply . mybe i didn't tell clearly . I want to index the item as a individual page .then when i search the some thing for example nutch-open source, the nutch return a hit which contain title : nutch-open source description : nutch nutch nutch nutch nutch url : http://lucene.apache.org/nutch category : news author : kauu so , is the plugin parse-rss can satisfy what i need? item titlenutch--open source/title description nutch nutch nutch nutch nutch /description linkhttp://lucene.apache.org/nutch/link categorynews /category authorkauu/author On 1/31/07, Chris Mattmann [EMAIL PROTECTED] wrote: Hi there, I could most likely be of assistance, if you gave me some more information. For instance: I'm wondering if the use case you describe below is already supported by the current RSS parse plugin? The current RSS parser, parse-rss, does in fact index individual items that are pointed to by an RSS document. The items are added as Nutch Outlinks, and added to the overall queue of URLs to fetch. Doesn't this satisfy what you mention below? Or am I missing something? Cheers, Chris On 1/30/07 6:01 PM, kauu [EMAIL PROTECTED] wrote: Hi folks : What's I want to do is to separate a rss file into several pages . Just as what has been discussed before. I want fetch a rss page and index it as different documents in the index. So the searcher can search the Item's info as a individual hit. What's my opinion create a protocol for fetch the rss page and store it as several one which just contain one ITEM tag .but the unique key is the url , so how can I store them with the ITEM's link tag as the unique key for a document. So my question is how to realize this function in nutch-.0.8.x. I've check the code of the plug-in protocol-http's code ,but I can't find the code where to store a page to a document. I want to separate the rss page to several ones before storing it as a document but several ones. So any one can give me some hints? Any reply will be appreciated ! ITEM's structure item title欧洲暴风雪后发制人 致航班 延误交通混乱(组图)/title description暴风雪横扫欧洲,导致多次航班延误 1 月24日,几架民航客机在德 国斯图加特机场内等待去除机身上冰雪。1月24日,工作人员在德国南部 的慕尼黑机场 清扫飞机跑道上的积雪。 据报道,迟来的暴风雪连续两天横扫中... /description linkhttp://news.sohu.com/20070125 http://news.sohu.com/20070125/n247833568.shtml /n247833568.shtml/ link category搜狐焦点图新闻/category author[EMAIL PROTECTED] /author pubDateThu, 25 Jan 2007 11:29:11 +0800/pubDate comments http://comment.news.sohu.com http://comment.news.sohu.com/comment/topic.jsp?id=247833847 /comment/topic.jsp?id=247833847/comments /item -- www.babatu.com