Re: log4j problem
set the two java arguments-Dhadoop.log.file and -Dhadoop.log.dir should fix your problem. btw,not to put much chinese characters in your mail.. - Original Message - From: kauu [EMAIL PROTECTED] To: nutch-dev@lucene.apache.org Sent: Wednesday, January 31, 2007 1:45 PM Subject: log4j problem why when I changed the nutch/conf/log4j.properties I just changed the first line Log4j.rootLogger=info,drfa to log4j.rootLogger=debug,drfa Like this: *** ** # RootLogger - DailyRollingFileAppender #log4j.rootLogger=INFO,DRFA log4j.rootLogger=DEBUG,DRFA # Logging Threshold log4j.threshhold=ALL #special logging requirements for some commandline tools log4j.logger.org.apache.nutch.crawl.Crawl=INFO,cmdstdout log4j.logger.org.apache.nutch.crawl.Injector=INFO,cmdstdout log4j.logger.org.apache.nutch.crawl.Generator=INFO,cmdstdout log4j.logger.org.apache.nutch.fetcher.Fetcher=INFO,cmdstdout log4j.logger.org.apache.nutch.parse.ParseSegment=INFO,cmdstdout log4j.logger.org.apache.nutch.crawl.CrawlDbReader=INFO,cmdstdout log4j.logger.org.apache.nutch.crawl.CrawlDbMerger=INFO,cmdstdout log4j.logger.org.apache.nutch.crawl.LinkDbReader=INFO,cmdstdout log4j.logger.org.apache.nutch.segment.SegmentReader=INFO,cmdstdout log4j.logger.org.apache.nutch.segment.SegmentMerger=INFO,cmdstdout log4j.logger.org.apache.nutch.crawl.CrawlDb=INFO,cmdstdout log4j.logger.org.apache.nutch.crawl.LinkDb=INFO,cmdstdout log4j.logger.org.apache.nutch.crawl.LinkDbMerger=INFO,cmdstdout log4j.logger.org.apache.nutch.indexer.Indexer=INFO,cmdstdout log4j.logger.org.apache.nutch.indexer.DeleteDuplicates=INFO,cmdstdout log4j.logger.org.apache.nutch.indexer.IndexMerger=INFO,cmdstdout ?? * ** * * In the console ,it show me the error like below log4j:ERROR setFile(null,true) call failed. java.io.FileNotFoundException: \ (???) at java.io.FileOutputStream.openAppend(Native Method) at java.io.FileOutputStream.init(Unknown Source) at java.io.FileOutputStream.init(Unknown Source) at org.apache.log4j.FileAppender.setFile(FileAppender.java:289) at org.apache.log4j.FileAppender.activateOptions(FileAppender.java:163) at org.apache.log4j.DailyRollingFileAppender.activateOptions(DailyRollingFileAp pender.java:215) at org.apache.log4j.config.PropertySetter.activate(PropertySetter.java:256) at org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:132 ) at org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:96) at org.apache.log4j.PropertyConfigurator.parseAppender(PropertyConfigurator.jav a:654) at org.apache.log4j.PropertyConfigurator.parseCategory(PropertyConfigurator.jav a:612) at org.apache.log4j.PropertyConfigurator.configureRootCategory(PropertyConfigur ator.java:509) at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java: 415) at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java: 441) at org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter. java:468) at org.apache.log4j.LogManager.clinit(LogManager.java:122) at org.apache.log4j.Logger.getLogger(Logger.java:104) at org.apache.commons.logging.impl.Log4JLogger.getLogger(Log4JLogger.java:229) at org.apache.commons.logging.impl.Log4JLogger.init(Log4JLogger.java:65) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
Re: RSS-fecter and index individul-how can i realize this function
hi , thx any way , but i don't think I tell clearly enough. what i want is nutch just fetch rss seeds for 1 depth. So nutch should just fetch some xml pages .I don't want to fetch the items' outlink 's pages, because there r too much spam in those pages. so , i just need to parse the rss file. so when i search some words which in description tag in one xml's item. the return hit will be like this title ==one item's title summary ==one item's description link ==one itme's outlink. so , i don't know whether the parse-rss plugin provide this function? On 1/31/07, Chris Mattmann [EMAIL PROTECTED] wrote: Hi there, With the explanation that you give below, it seems like parse-rss as it exists would address what you are trying to do. parse-rss parses an RSS channel as a set of items, and indexes overall metadata about the RSS file, including parse text, and index data, but it also adds each item (in the channel)'s URL as an Outlink, so that Nutch will process those pieces of content as well. The only thing that you suggest below that parse-rss currently doesn't do, is to allow you to associate the metadata fields category:, and author: with the item Outlink... Cheers, Chris On 1/30/07 7:30 PM, kauu [EMAIL PROTECTED] wrote: thx for ur reply . mybe i didn't tell clearly . I want to index the item as a individual page .then when i search the some thing for example nutch-open source, the nutch return a hit which contain title : nutch-open source description : nutch nutch nutch nutch nutch url : http://lucene.apache.org/nutch category : news author : kauu so , is the plugin parse-rss can satisfy what i need? item titlenutch--open source/title description nutch nutch nutch nutch nutch /description linkhttp://lucene.apache.org/nutch/link categorynews /category authorkauu/author On 1/31/07, Chris Mattmann [EMAIL PROTECTED] wrote: Hi there, I could most likely be of assistance, if you gave me some more information. For instance: I'm wondering if the use case you describe below is already supported by the current RSS parse plugin? The current RSS parser, parse-rss, does in fact index individual items that are pointed to by an RSS document. The items are added as Nutch Outlinks, and added to the overall queue of URLs to fetch. Doesn't this satisfy what you mention below? Or am I missing something? Cheers, Chris On 1/30/07 6:01 PM, kauu [EMAIL PROTECTED] wrote: Hi folks : What's I want to do is to separate a rss file into several pages . Just as what has been discussed before. I want fetch a rss page and index it as different documents in the index. So the searcher can search the Item's info as a individual hit. What's my opinion create a protocol for fetch the rss page and store it as several one which just contain one ITEM tag .but the unique key is the url , so how can I store them with the ITEM's link tag as the unique key for a document. So my question is how to realize this function in nutch-.0.8.x. I've check the code of the plug-in protocol-http's code ,but I can't find the code where to store a page to a document. I want to separate the rss page to several ones before storing it as a document but several ones. So any one can give me some hints? Any reply will be appreciated ! ITEM's structure item title欧洲暴风雪后发制人 致航班 延误交通混乱(组图)/title description暴风雪横扫欧洲,导致多次航班延误 1 月24日,几架民航客机在德 国斯图加特机场内等待去除机身上冰雪。1月24日,工作人员在德国南部 的慕尼黑机场 清扫飞机跑道上的积雪。 据报道,迟来的暴风雪连续两天横扫中... /description linkhttp://news.sohu.com/20070125 http://news.sohu.com/20070125/n247833568.shtml /n247833568.shtml/ link category搜狐焦点图新闻/category author[EMAIL PROTECTED] /author pubDateThu, 25 Jan 2007 11:29:11 +0800/pubDate comments http://comment.news.sohu.com http://comment.news.sohu.com/comment/topic.jsp?id=247833847 /comment/topic.jsp?id=247833847/comments /item -- www.babatu.com -- www.babatu.com
Re: log4j problem
sorry , i will be careful .thx any way On 1/31/07, chee wu [EMAIL PROTECTED] wrote: set the two java arguments-Dhadoop.log.file and -Dhadoop.log.dir should fix your problem. btw,not to put much chinese characters in your mail.. - Original Message - From: kauu [EMAIL PROTECTED] To: nutch-dev@lucene.apache.org Sent: Wednesday, January 31, 2007 1:45 PM Subject: log4j problem why when I changed the nutch/conf/log4j.properties I just changed the first line Log4j.rootLogger=info,drfa to log4j.rootLogger=debug,drfa Like this: *** ** # RootLogger - DailyRollingFileAppender #log4j.rootLogger=INFO,DRFA log4j.rootLogger=DEBUG,DRFA # Logging Threshold log4j.threshhold=ALL #special logging requirements for some commandline tools log4j.logger.org.apache.nutch.crawl.Crawl=INFO,cmdstdout log4j.logger.org.apache.nutch.crawl.Injector=INFO,cmdstdout log4j.logger.org.apache.nutch.crawl.Generator=INFO,cmdstdout log4j.logger.org.apache.nutch.fetcher.Fetcher=INFO,cmdstdout log4j.logger.org.apache.nutch.parse.ParseSegment=INFO,cmdstdout log4j.logger.org.apache.nutch.crawl.CrawlDbReader=INFO,cmdstdout log4j.logger.org.apache.nutch.crawl.CrawlDbMerger=INFO,cmdstdout log4j.logger.org.apache.nutch.crawl.LinkDbReader=INFO,cmdstdout log4j.logger.org.apache.nutch.segment.SegmentReader=INFO,cmdstdout log4j.logger.org.apache.nutch.segment.SegmentMerger=INFO,cmdstdout log4j.logger.org.apache.nutch.crawl.CrawlDb=INFO,cmdstdout log4j.logger.org.apache.nutch.crawl.LinkDb=INFO,cmdstdout log4j.logger.org.apache.nutch.crawl.LinkDbMerger=INFO,cmdstdout log4j.logger.org.apache.nutch.indexer.Indexer=INFO,cmdstdout log4j.logger.org.apache.nutch.indexer.DeleteDuplicates=INFO,cmdstdout log4j.logger.org.apache.nutch.indexer.IndexMerger=INFO,cmdstdout ?? * ** * * In the console ,it show me the error like below log4j:ERROR setFile(null,true) call failed. java.io.FileNotFoundException: \ (???) at java.io.FileOutputStream.openAppend(Native Method) at java.io.FileOutputStream.init(Unknown Source) at java.io.FileOutputStream.init(Unknown Source) at org.apache.log4j.FileAppender.setFile(FileAppender.java:289) at org.apache.log4j.FileAppender.activateOptions(FileAppender.java :163) at org.apache.log4j.DailyRollingFileAppender.activateOptions (DailyRollingFileAp pender.java:215) at org.apache.log4j.config.PropertySetter.activate(PropertySetter.java:256) at org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java :132 ) at org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java :96) at org.apache.log4j.PropertyConfigurator.parseAppender( PropertyConfigurator.jav a:654) at org.apache.log4j.PropertyConfigurator.parseCategory( PropertyConfigurator.jav a:612) at org.apache.log4j.PropertyConfigurator.configureRootCategory (PropertyConfigur ator.java:509) at org.apache.log4j.PropertyConfigurator.doConfigure( PropertyConfigurator.java: 415) at org.apache.log4j.PropertyConfigurator.doConfigure( PropertyConfigurator.java: 441) at org.apache.log4j.helpers.OptionConverter.selectAndConfigure (OptionConverter. java:468) at org.apache.log4j.LogManager.clinit(LogManager.java:122) at org.apache.log4j.Logger.getLogger(Logger.java:104) at org.apache.commons.logging.impl.Log4JLogger.getLogger(Log4JLogger.java :229) at org.apache.commons.logging.impl.Log4JLogger.init(Log4JLogger.java:65) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) -- www.babatu.com
Can't Compile Revision 501954
hello! Does anybody know why I get the following error running ant on the revision I have checked out from svn? Maybe its a dumb question but... Thank you for your help! compile: [echo] Compiling plugin: parse-html [javac] Compiling 5 source files to nutch/trunk/build/parse-html/classes [javac] nutch/trunk/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java:102: cannot access org.apache.nutch.parse.HtmlParseFilters [javac] bad class file: nutch/trunk/build/classes/org/apache/nutch/parse/HtmlParseFilters.class [javac] illegal start of class file [javac] Please remove or make sure it appears in the correct subdirectory of the classpath. [javac] private HtmlParseFilters htmlParseFilters; [javac] ^ [javac] 1 error
RE: RSS-fecter and index individul-how can i realize this function
Hi, Many sites provide RSS feeds for several reasons, usually to save bandwidth, to give the users concentrated data and so forth. Some of the RSS files supplied by sites are created specially for search engines where each RSS item represent a web page in the site. IMHO the only thing missing in the parse-rss plugin is storing the data in the CrawlDatum and parsing it in the next fetch phase. Maybe adding a new flag to CrawlDatum, that would flag the URL as parsable not fetchable? Just my two cents... Gal. -Original Message- From: Chris Mattmann [mailto:[EMAIL PROTECTED] Sent: Wednesday, January 31, 2007 8:44 AM To: nutch-dev@lucene.apache.org Subject: Re: RSS-fecter and index individul-how can i realize this function Hi there, With the explanation that you give below, it seems like parse-rss as it exists would address what you are trying to do. parse-rss parses an RSS channel as a set of items, and indexes overall metadata about the RSS file, including parse text, and index data, but it also adds each item (in the channel)'s URL as an Outlink, so that Nutch will process those pieces of content as well. The only thing that you suggest below that parse-rss currently doesn't do, is to allow you to associate the metadata fields category:, and author: with the item Outlink... Cheers, Chris On 1/30/07 7:30 PM, kauu [EMAIL PROTECTED] wrote: thx for ur reply . mybe i didn't tell clearly . I want to index the item as a individual page .then when i search the some thing for example nutch-open source, the nutch return a hit which contain title : nutch-open source description : nutch nutch nutch nutch nutch url : http://lucene.apache.org/nutch category : news author : kauu so , is the plugin parse-rss can satisfy what i need? item titlenutch--open source/title description nutch nutch nutch nutch nutch /description linkhttp://lucene.apache.org/nutch/link categorynews /category authorkauu/author On 1/31/07, Chris Mattmann [EMAIL PROTECTED] wrote: Hi there, I could most likely be of assistance, if you gave me some more information. For instance: I'm wondering if the use case you describe below is already supported by the current RSS parse plugin? The current RSS parser, parse-rss, does in fact index individual items that are pointed to by an RSS document. The items are added as Nutch Outlinks, and added to the overall queue of URLs to fetch. Doesn't this satisfy what you mention below? Or am I missing something? Cheers, Chris On 1/30/07 6:01 PM, kauu [EMAIL PROTECTED] wrote: Hi folks : What's I want to do is to separate a rss file into several pages . Just as what has been discussed before. I want fetch a rss page and index it as different documents in the index. So the searcher can search the Item's info as a individual hit. What's my opinion create a protocol for fetch the rss page and store it as several one which just contain one ITEM tag .but the unique key is the url , so how can I store them with the ITEM's link tag as the unique key for a document. So my question is how to realize this function in nutch-.0.8.x. I've check the code of the plug-in protocol-http's code ,but I can't find the code where to store a page to a document. I want to separate the rss page to several ones before storing it as a document but several ones. So any one can give me some hints? Any reply will be appreciated ! ITEM's structure item title欧洲暴风雪后发制人 致航班 延误交通混乱(组图)/title description暴风雪横扫欧洲,导致多次航班延误 1 月24日,几架民航客机在德 国斯图加特机场内等待去除机身上冰雪。1月24日,工作人员在德国南部 的慕尼黑机场 清扫飞机跑道上的积雪。 据报道,迟来的暴风雪连续两天横扫中... /description linkhttp://news.sohu.com/20070125 http://news.sohu.com/20070125/n247833568.shtml /n247833568.shtml/ link category搜狐焦点图新闻/category author[EMAIL PROTECTED] /author pubDateThu, 25 Jan 2007 11:29:11 +0800/pubDate comments http://comment.news.sohu.com http://comment.news.sohu.com/comment/topic.jsp?id=247833847 /comment/topic.jsp?id=247833847/comments /item -- www.babatu.com