Re: RSS-fecter and index individul-how can i realize this function
I've change code like what u said, but i get an exception like this. why, why is the MD5Signature class's exception 2007-02-05 11:28:38,453 WARN feedparser.FeedFilter ( FeedFilter.java:doDecodeEntities(223)) - Filter encountered unknown entities 2007-02-05 11:28:39,390 INFO crawl.SignatureFactory ( SignatureFactory.java:getSignature(45)) - Using Signature impl: org.apache.nutch.crawl.MD5Signature 2007-02-05 11:28:40,078 WARN mapred.LocalJobRunner (LocalJobRunner.java:run(120)) - job_f6j55m java.lang.NullPointerException at org.apache.nutch.parse.ParseOutputFormat$1.write( ParseOutputFormat.java:121) at org.apache.nutch.fetcher.FetcherOutputFormat$1.write( FetcherOutputFormat.java:87) at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:235) at org.apache.hadoop.mapred.lib.IdentityReducer.reduce( IdentityReducer.java:39) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:247) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java :112) On 2/3/07, Renaud Richardet [EMAIL PROTECTED] wrote: Gal, Chris, Kauu, So, if I understand correctly, you need a way to pass information along the fetches, so that when Nutch fetches a feed entry, its item value previously fetched is available. This is how I tackled the issue: - extend Outlinks.java and allow to create outlinks with more meta data. So, in your feed parser, use this way to create outlinks - pass on the metadata through ParseOutputFormat.java and Fetcher.java - retrieve the metadata in HtmlParser.java and use it This is very tedious, will blow the size of your outlinks db, makes changes in the core code of Nutch, etc... But this is the only way I came up with... If someone sees a better way, please let me know :-) Sample code, for Nutch 0.8.x : Outlink.java + public Outlink(String toUrl, String anchor, String entryContents, Configuration conf) throws MalformedURLException { + this.toUrl = new UrlNormalizerFactory(conf).getNormalizer().normalize(toUrl); + this.anchor = anchor; + + this.entryContents= entryContents; + } and update the other methods ParseOutputFormat.java, around lines 140 +// set outlink info in metadata ME +String entryContents= links[i].getEntryContents(); + +if (entryContents.length() 0) { // it's a feed entry +MapWritable meta = new MapWritable(); +meta.put(new UTF8(entryContents), new UTF8(entryContents));//key/value +target = new CrawlDatum(CrawlDatum.STATUS_LINKED, interval); +target.setMetaData(meta); +} else { +target = new CrawlDatum(CrawlDatum.STATUS_LINKED, interval); // no meta +} Fetcher.java, around l. 266 + // add feed info to metadata + try { + String entryContents = datum.getMetaData().get(new UTF8(entryContents)).toString(); + metadata.set(entryContents, entryContents); + } catch (Exception e) { } //not found HtmlParser.java // get entry metadata String entryContents = content.getMetadata().get(entryContents); HTH, Renaud Gal Nitzan wrote: Hi Chris, I'm sorry I wasn't clear enough. What I mean is that in the current implementation: 1. The RSS (channels, items) page ends up as one Lucene document in the index. 2. Indeed the links are extracted and each item link will be fetched in the next fetch as a separate page and will end up as one Lucene document. IMHO the data that is needed i.e. the data that will be fetched in the next fetch process is already available in the item element. Each item element represents one web resource. And there is no reason to go to the server and re-fetch that resource. Another issue that arises from rss feeds is that once the feed page is fetched you can not re-fetch it until its time to fetch expired. The feeds TTL is usually very short. Since for now in Nutch, all pages created equal :) it is one more thing to think about. HTH, Gal. -Original Message- From: Chris Mattmann [mailto:[EMAIL PROTECTED] Sent: Thursday, February 01, 2007 7:01 PM To: nutch-dev@lucene.apache.org Subject: Re: RSS-fecter and index individul-how can i realize this function Hi Gal, et al., I'd like to be explicit when we talk about what the issue with the RSS parsing plugin is here; I think we have had conversations similar to this before and it seems that we keep talking around each other. I'd like to get to the heart of this matter so that the issue (if there is an actual one) gets addressed ;) Okay, so you mention below that the thing that you see missing from the current RSS parsing plugin is the ability to store data in the CrawlDatum, and parse it in the next fetch phase. Well, there are 2 options here for what you refer to as it: 1. If you're talking about the RSS file, then in fact, it is parsed, and its data is stored in the CrawlDatum, akin to any other form of content that is fetched, parsed
Re: RSS-fecter and index individul-how can i realize this function
hi all, what Gal said is just my meaning on the rss-parse need. i just want to fetch rss seeds once, On 2/2/07, Gal Nitzan [EMAIL PROTECTED] wrote: Hi Chris, I'm sorry I wasn't clear enough. What I mean is that in the current implementation: 1. The RSS (channels, items) page ends up as one Lucene document in the index. 2. Indeed the links are extracted and each item link will be fetched in the next fetch as a separate page and will end up as one Lucene document. IMHO the data that is needed i.e. the data that will be fetched in the next fetch process is already available in the item element. Each item element represents one web resource. And there is no reason to go to the server and re-fetch that resource. Another issue that arises from rss feeds is that once the feed page is fetched you can not re-fetch it until its time to fetch expired. The feeds TTL is usually very short. Since for now in Nutch, all pages created equal :) it is one more thing to think about. HTH, Gal. -Original Message- From: Chris Mattmann [mailto:[EMAIL PROTECTED] Sent: Thursday, February 01, 2007 7:01 PM To: nutch-dev@lucene.apache.org Subject: Re: RSS-fecter and index individul-how can i realize this function Hi Gal, et al., I'd like to be explicit when we talk about what the issue with the RSS parsing plugin is here; I think we have had conversations similar to this before and it seems that we keep talking around each other. I'd like to get to the heart of this matter so that the issue (if there is an actual one) gets addressed ;) Okay, so you mention below that the thing that you see missing from the current RSS parsing plugin is the ability to store data in the CrawlDatum, and parse it in the next fetch phase. Well, there are 2 options here for what you refer to as it: 1. If you're talking about the RSS file, then in fact, it is parsed, and its data is stored in the CrawlDatum, akin to any other form of content that is fetched, parsed and indexed. 2. If you're talking about the item links within the RSS file, in fact, they are parsed (eventually), and their data stored in the CrawlDatum, akin to any other form of content that is fetched, parsed, and indexed. This is accomplished by adding the RSS items as Outlinks when the RSS file is parsed: in this fashion, we go after all of the links in the RSS file, and make sure that we index their content as well. Thus, if you had an RSS file R that contained links in it to a PDF file A, and another HTML page P, then not only would R get fetched, parsed, and indexed, but so would A and P, because they are item links within R. Then queries that would match R (the physical RSS file), would additionally match things such as P and A, and all 3 would be capable of being returned in a Nutch query. Does this make sense? Is this the issue that you're talking about? Am I nuts? ;) Cheers, Chris On 1/31/07 10:40 PM, Gal Nitzan [EMAIL PROTECTED] wrote: Hi, Many sites provide RSS feeds for several reasons, usually to save bandwidth, to give the users concentrated data and so forth. Some of the RSS files supplied by sites are created specially for search engines where each RSS item represent a web page in the site. IMHO the only thing missing in the parse-rss plugin is storing the data in the CrawlDatum and parsing it in the next fetch phase. Maybe adding a new flag to CrawlDatum, that would flag the URL as parsable not fetchable? Just my two cents... Gal. -Original Message- From: Chris Mattmann [mailto:[EMAIL PROTECTED] Sent: Wednesday, January 31, 2007 8:44 AM To: nutch-dev@lucene.apache.org Subject: Re: RSS-fecter and index individul-how can i realize this function Hi there, With the explanation that you give below, it seems like parse-rss as it exists would address what you are trying to do. parse-rss parses an RSS channel as a set of items, and indexes overall metadata about the RSS file, including parse text, and index data, but it also adds each item (in the channel)'s URL as an Outlink, so that Nutch will process those pieces of content as well. The only thing that you suggest below that parse-rss currently doesn't do, is to allow you to associate the metadata fields category:, and author: with the item Outlink... Cheers, Chris On 1/30/07 7:30 PM, kauu [EMAIL PROTECTED] wrote: thx for ur reply . mybe i didn't tell clearly . I want to index the item as a individual page .then when i search the some thing for example nutch-open source, the nutch return a hit which contain title : nutch-open source description : nutch nutch nutch nutch nutch url : http://lucene.apache.org/nutch category : news author : kauu so , is the plugin parse-rss can satisfy what i need? item titlenutch--open source/title description nutch nutch nutch nutch nutch /description linkhttp://lucene.apache.org/nutch/link categorynews /category
Re: RSS-fecter and index individul-how can i realize this function
hi , thx any way , but i don't think I tell clearly enough. what i want is nutch just fetch rss seeds for 1 depth. So nutch should just fetch some xml pages .I don't want to fetch the items' outlink 's pages, because there r too much spam in those pages. so , i just need to parse the rss file. so when i search some words which in description tag in one xml's item. the return hit will be like this title ==one item's title summary ==one item's description link ==one itme's outlink. so , i don't know whether the parse-rss plugin provide this function? On 1/31/07, Chris Mattmann [EMAIL PROTECTED] wrote: Hi there, With the explanation that you give below, it seems like parse-rss as it exists would address what you are trying to do. parse-rss parses an RSS channel as a set of items, and indexes overall metadata about the RSS file, including parse text, and index data, but it also adds each item (in the channel)'s URL as an Outlink, so that Nutch will process those pieces of content as well. The only thing that you suggest below that parse-rss currently doesn't do, is to allow you to associate the metadata fields category:, and author: with the item Outlink... Cheers, Chris On 1/30/07 7:30 PM, kauu [EMAIL PROTECTED] wrote: thx for ur reply . mybe i didn't tell clearly . I want to index the item as a individual page .then when i search the some thing for example nutch-open source, the nutch return a hit which contain title : nutch-open source description : nutch nutch nutch nutch nutch url : http://lucene.apache.org/nutch category : news author : kauu so , is the plugin parse-rss can satisfy what i need? item titlenutch--open source/title description nutch nutch nutch nutch nutch /description linkhttp://lucene.apache.org/nutch/link categorynews /category authorkauu/author On 1/31/07, Chris Mattmann [EMAIL PROTECTED] wrote: Hi there, I could most likely be of assistance, if you gave me some more information. For instance: I'm wondering if the use case you describe below is already supported by the current RSS parse plugin? The current RSS parser, parse-rss, does in fact index individual items that are pointed to by an RSS document. The items are added as Nutch Outlinks, and added to the overall queue of URLs to fetch. Doesn't this satisfy what you mention below? Or am I missing something? Cheers, Chris On 1/30/07 6:01 PM, kauu [EMAIL PROTECTED] wrote: Hi folks : What's I want to do is to separate a rss file into several pages . Just as what has been discussed before. I want fetch a rss page and index it as different documents in the index. So the searcher can search the Item's info as a individual hit. What's my opinion create a protocol for fetch the rss page and store it as several one which just contain one ITEM tag .but the unique key is the url , so how can I store them with the ITEM's link tag as the unique key for a document. So my question is how to realize this function in nutch-.0.8.x. I've check the code of the plug-in protocol-http's code ,but I can't find the code where to store a page to a document. I want to separate the rss page to several ones before storing it as a document but several ones. So any one can give me some hints? Any reply will be appreciated ! ITEM's structure item title欧洲暴风雪后发制人 致航班 延误交通混乱(组图)/title description暴风雪横扫欧洲,导致多次航班延误 1 月24日,几架民航客机在德 国斯图加特机场内等待去除机身上冰雪。1月24日,工作人员在德国南部 的慕尼黑机场 清扫飞机跑道上的积雪。 据报道,迟来的暴风雪连续两天横扫中... /description linkhttp://news.sohu.com/20070125 http://news.sohu.com/20070125/n247833568.shtml /n247833568.shtml/ link category搜狐焦点图新闻/category author[EMAIL PROTECTED] /author pubDateThu, 25 Jan 2007 11:29:11 +0800/pubDate comments http://comment.news.sohu.com http://comment.news.sohu.com/comment/topic.jsp?id=247833847 /comment/topic.jsp?id=247833847/comments /item -- www.babatu.com -- www.babatu.com
Re: log4j problem
sorry , i will be careful .thx any way On 1/31/07, chee wu [EMAIL PROTECTED] wrote: set the two java arguments-Dhadoop.log.file and -Dhadoop.log.dir should fix your problem. btw,not to put much chinese characters in your mail.. - Original Message - From: kauu [EMAIL PROTECTED] To: nutch-dev@lucene.apache.org Sent: Wednesday, January 31, 2007 1:45 PM Subject: log4j problem why when I changed the nutch/conf/log4j.properties I just changed the first line Log4j.rootLogger=info,drfa to log4j.rootLogger=debug,drfa Like this: *** ** # RootLogger - DailyRollingFileAppender #log4j.rootLogger=INFO,DRFA log4j.rootLogger=DEBUG,DRFA # Logging Threshold log4j.threshhold=ALL #special logging requirements for some commandline tools log4j.logger.org.apache.nutch.crawl.Crawl=INFO,cmdstdout log4j.logger.org.apache.nutch.crawl.Injector=INFO,cmdstdout log4j.logger.org.apache.nutch.crawl.Generator=INFO,cmdstdout log4j.logger.org.apache.nutch.fetcher.Fetcher=INFO,cmdstdout log4j.logger.org.apache.nutch.parse.ParseSegment=INFO,cmdstdout log4j.logger.org.apache.nutch.crawl.CrawlDbReader=INFO,cmdstdout log4j.logger.org.apache.nutch.crawl.CrawlDbMerger=INFO,cmdstdout log4j.logger.org.apache.nutch.crawl.LinkDbReader=INFO,cmdstdout log4j.logger.org.apache.nutch.segment.SegmentReader=INFO,cmdstdout log4j.logger.org.apache.nutch.segment.SegmentMerger=INFO,cmdstdout log4j.logger.org.apache.nutch.crawl.CrawlDb=INFO,cmdstdout log4j.logger.org.apache.nutch.crawl.LinkDb=INFO,cmdstdout log4j.logger.org.apache.nutch.crawl.LinkDbMerger=INFO,cmdstdout log4j.logger.org.apache.nutch.indexer.Indexer=INFO,cmdstdout log4j.logger.org.apache.nutch.indexer.DeleteDuplicates=INFO,cmdstdout log4j.logger.org.apache.nutch.indexer.IndexMerger=INFO,cmdstdout ?? * ** * * In the console ,it show me the error like below log4j:ERROR setFile(null,true) call failed. java.io.FileNotFoundException: \ (???) at java.io.FileOutputStream.openAppend(Native Method) at java.io.FileOutputStream.init(Unknown Source) at java.io.FileOutputStream.init(Unknown Source) at org.apache.log4j.FileAppender.setFile(FileAppender.java:289) at org.apache.log4j.FileAppender.activateOptions(FileAppender.java :163) at org.apache.log4j.DailyRollingFileAppender.activateOptions (DailyRollingFileAp pender.java:215) at org.apache.log4j.config.PropertySetter.activate(PropertySetter.java:256) at org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java :132 ) at org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java :96) at org.apache.log4j.PropertyConfigurator.parseAppender( PropertyConfigurator.jav a:654) at org.apache.log4j.PropertyConfigurator.parseCategory( PropertyConfigurator.jav a:612) at org.apache.log4j.PropertyConfigurator.configureRootCategory (PropertyConfigur ator.java:509) at org.apache.log4j.PropertyConfigurator.doConfigure( PropertyConfigurator.java: 415) at org.apache.log4j.PropertyConfigurator.doConfigure( PropertyConfigurator.java: 441) at org.apache.log4j.helpers.OptionConverter.selectAndConfigure (OptionConverter. java:468) at org.apache.log4j.LogManager.clinit(LogManager.java:122) at org.apache.log4j.Logger.getLogger(Logger.java:104) at org.apache.commons.logging.impl.Log4JLogger.getLogger(Log4JLogger.java :229) at org.apache.commons.logging.impl.Log4JLogger.init(Log4JLogger.java:65) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) -- www.babatu.com
RSS-fecter and index individul-how can i realize this function
Hi folks : What’s I want to do is to separate a rss file into several pages . Just as what has been discussed before. I want fetch a rss page and index it as different documents in the index. So the searcher can search the Item’s info as a individual hit. What’s my opinion create a protocol for fetch the rss page and store it as several one which just contain one ITEM tag .but the unique key is the url , so how can I store them with the ITEM’s link tag as the unique key for a document. So my question is how to realize this function in nutch-.0.8.x. I’ve check the code of the plug-in protocol-http’s code ,but I can’t find the code where to store a page to a document. I want to separate the rss page to several ones before storing it as a document but several ones. So any one can give me some hints? Any reply will be appreciated ! ITEM’s structure item title欧洲暴风雪后发制人 致航班延误交通混乱(组图)/title description暴风雪横扫欧洲,导致多次航班延误 1月24日,几架民航客机在德 国斯图加特机场内等待去除机身上冰雪。1月24日,工作人员在德国南部的慕尼黑机场 清扫飞机跑道上的积雪。 据报道,迟来的暴风雪连续两天横扫中... /description linkhttp://news.sohu.com/20070125 http://news.sohu.com/20070125/n247833568.shtml /n247833568.shtml/ link category搜狐焦点图新闻/category author[EMAIL PROTECTED] /author pubDateThu, 25 Jan 2007 11:29:11 +0800/pubDate comments http://comment.news.sohu.com http://comment.news.sohu.com/comment/topic.jsp?id=247833847 /comment/topic.jsp?id=247833847/comments /item
Re: RSS-fecter and index individul-how can i realize this function
thx for ur reply . mybe i didn't tell clearly . I want to index the item as a individual page .then when i search the some thing for example nutch-open source, the nutch return a hit which contain title : nutch-open source description : nutch nutch nutch nutch nutch url : http://lucene.apache.org/nutch category : news author : kauu so , is the plugin parse-rss can satisfy what i need? item titlenutch--open source/title description nutch nutch nutch nutch nutch /description linkhttp://lucene.apache.org/nutch/link categorynews /category authorkauu/author On 1/31/07, Chris Mattmann [EMAIL PROTECTED] wrote: Hi there, I could most likely be of assistance, if you gave me some more information. For instance: I'm wondering if the use case you describe below is already supported by the current RSS parse plugin? The current RSS parser, parse-rss, does in fact index individual items that are pointed to by an RSS document. The items are added as Nutch Outlinks, and added to the overall queue of URLs to fetch. Doesn't this satisfy what you mention below? Or am I missing something? Cheers, Chris On 1/30/07 6:01 PM, kauu [EMAIL PROTECTED] wrote: Hi folks : What's I want to do is to separate a rss file into several pages . Just as what has been discussed before. I want fetch a rss page and index it as different documents in the index. So the searcher can search the Item's info as a individual hit. What's my opinion create a protocol for fetch the rss page and store it as several one which just contain one ITEM tag .but the unique key is the url , so how can I store them with the ITEM's link tag as the unique key for a document. So my question is how to realize this function in nutch-.0.8.x. I've check the code of the plug-in protocol-http's code ,but I can't find the code where to store a page to a document. I want to separate the rss page to several ones before storing it as a document but several ones. So any one can give me some hints? Any reply will be appreciated ! ITEM's structure item title欧洲暴风雪后发制人 致航班延误交通混乱(组图)/title description暴风雪横扫欧洲,导致多次航班延误 1月24日,几架民航客机在德 国斯图加特机场内等待去除机身上冰雪。1月24日,工作人员在德国南部的慕尼黑机场 清扫飞机跑道上的积雪。 据报道,迟来的暴风雪连续两天横扫中... /description linkhttp://news.sohu.com/20070125 http://news.sohu.com/20070125/n247833568.shtml /n247833568.shtml/ link category搜狐焦点图新闻/category author[EMAIL PROTECTED] /author pubDateThu, 25 Jan 2007 11:29:11 +0800/pubDate comments http://comment.news.sohu.com http://comment.news.sohu.com/comment/topic.jsp?id=247833847 /comment/topic.jsp?id=247833847/comments /item -- www.babatu.com
log4j problem
why when I changed the nutch/conf/log4j.properties I just changed the first line Log4j.rootLogger=info,drfa to log4j.rootLogger=debug,drfa Like this: *** ** # RootLogger - DailyRollingFileAppender #log4j.rootLogger=INFO,DRFA log4j.rootLogger=DEBUG,DRFA # Logging Threshold log4j.threshhold=ALL #special logging requirements for some commandline tools log4j.logger.org.apache.nutch.crawl.Crawl=INFO,cmdstdout log4j.logger.org.apache.nutch.crawl.Injector=INFO,cmdstdout log4j.logger.org.apache.nutch.crawl.Generator=INFO,cmdstdout log4j.logger.org.apache.nutch.fetcher.Fetcher=INFO,cmdstdout log4j.logger.org.apache.nutch.parse.ParseSegment=INFO,cmdstdout log4j.logger.org.apache.nutch.crawl.CrawlDbReader=INFO,cmdstdout log4j.logger.org.apache.nutch.crawl.CrawlDbMerger=INFO,cmdstdout log4j.logger.org.apache.nutch.crawl.LinkDbReader=INFO,cmdstdout log4j.logger.org.apache.nutch.segment.SegmentReader=INFO,cmdstdout log4j.logger.org.apache.nutch.segment.SegmentMerger=INFO,cmdstdout log4j.logger.org.apache.nutch.crawl.CrawlDb=INFO,cmdstdout log4j.logger.org.apache.nutch.crawl.LinkDb=INFO,cmdstdout log4j.logger.org.apache.nutch.crawl.LinkDbMerger=INFO,cmdstdout log4j.logger.org.apache.nutch.indexer.Indexer=INFO,cmdstdout log4j.logger.org.apache.nutch.indexer.DeleteDuplicates=INFO,cmdstdout log4j.logger.org.apache.nutch.indexer.IndexMerger=INFO,cmdstdout 。。 * ** * * In the console ,it show me the error like below log4j:ERROR setFile(null,true) call failed. java.io.FileNotFoundException: \ (系统找不到指定的路径。) at java.io.FileOutputStream.openAppend(Native Method) at java.io.FileOutputStream.init(Unknown Source) at java.io.FileOutputStream.init(Unknown Source) at org.apache.log4j.FileAppender.setFile(FileAppender.java:289) at org.apache.log4j.FileAppender.activateOptions(FileAppender.java:163) at org.apache.log4j.DailyRollingFileAppender.activateOptions(DailyRollingFileAp pender.java:215) at org.apache.log4j.config.PropertySetter.activate(PropertySetter.java:256) at org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:132 ) at org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:96) at org.apache.log4j.PropertyConfigurator.parseAppender(PropertyConfigurator.jav a:654) at org.apache.log4j.PropertyConfigurator.parseCategory(PropertyConfigurator.jav a:612) at org.apache.log4j.PropertyConfigurator.configureRootCategory(PropertyConfigur ator.java:509) at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java: 415) at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java: 441) at org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter. java:468) at org.apache.log4j.LogManager.clinit(LogManager.java:122) at org.apache.log4j.Logger.getLogger(Logger.java:104) at org.apache.commons.logging.impl.Log4JLogger.getLogger(Log4JLogger.java:229) at org.apache.commons.logging.impl.Log4JLogger.init(Log4JLogger.java:65) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
Re: parse-rss make them items as different pages
it's a great idea i think . we can't just have more than one document in the index because of the unique key is the URL. but the only problem is that how to write a separate protocol for the RSS. On 1/28/07, Alan Tanaman [EMAIL PROTECTED] wrote: This is a problem that we have encountered too (although in a different context than RSS). The problem is that the unique key is the URL - you cannot have more than one document in the index with the same URL. The way around this might be to have a separate protocol (instead of the usual http one) that will be used only for RSS feeds, and which will append an sequential number to the real-URL (passing say 10 identical copies of each page to the parse-rss). The parse-rss would need to extract only the nth news item from the whole page. Any comments? Best regards, Alan _ Alan Tanaman iDNA Solutions http://blog.idna-solutions.com -Original Message- From: kauu [mailto:[EMAIL PROTECTED] Sent: 27 January 2007 06:43 To: nutch-dev@lucene.apache.org; [EMAIL PROTECTED] Subject: Re: parse-rss make them items as different pages who can tell me where and how to build a nutch document in nutch-0.8.1? for example , one html page is a document , but i want to detach a document to several ones . On 1/27/07, kauu [EMAIL PROTECTED] wrote: that's the right thing. i think we should to do some thing when nutch fetch a page successfully, judge if a rss then create as many pages as the items' number.i don't know whether it work. In the other hand , we can do some thing in the segment just like what u say . i don't know that whether we can write a plugin to get the functionality. anyone who can give me some hint? On 1/26/07, Gal Nitzan [EMAIL PROTECTED] wrote: Hi Kauu, The functionality you require doesn't exist in the current parse-rss plugin. I need the same functionality but it doesn't exist and I believe it's not a simple task. The functionality required basically is to create a page in a segment for each item and the URL to the crawldb. Since the data already exists in the item element there is no reason to fetch the page (item). After that the only thing left is to index it. Any thoughts on how to achieve that goal? Gal. -Original Message- From: kauu [mailto:[EMAIL PROTECTED] Sent: Friday, January 26, 2007 4:17 AM To: nutch-dev@lucene.apache.org Subject: parse-rss make them items as different pages i want to crawl the rss feeds and parse them ,then index them and at last when search the content I just want that the hit just like an individual page. i don't know wether i tell u clearly. item title欧洲暴风雪后发制人 致航班延误交通混乱(组图)/title description暴风雪横扫欧洲,导致多次航班延误 1月24日,几架民航客机在德国斯图加特机场内等待去除机身上冰雪。1月24日,工 作人员在德国南部的慕尼黑机场清扫飞机跑道上的积雪。 据报道,迟来的暴风雪连续两天横扫中... /description linkhttp://news.sohu.com/20070125/n247833568.shtml /link category搜狐焦点图新闻/category author[EMAIL PROTECTED]/author pubDateThu, 25 Jan 2007 11:29:11 +0800/pubDate comments http://comment.news.sohu.com/comment/topic.jsp?id=247833847/comments /item this one item in an rss file i want nutch deal with an item like an individual page. so i search something in this item,the nutch return it as a hit. so ... any one can tell me how to do about ? any reply will be appreciated -- www.babatu.com -- www.babatu.com -- www.babatu.com -- www.babatu.com
Re: parse-rss make them items as different pages
that's the right thing. i think we should to do some thing when nutch fetch a page successfully, judge if a rss then create as many pages as the items' number.i don't know whether it work. In the other hand , we can do some thing in the segment just like what u say . i don't know that whether we can write a plugin to get the functionality. anyone who can give me some hint? On 1/26/07, Gal Nitzan [EMAIL PROTECTED] wrote: Hi Kauu, The functionality you require doesn't exist in the current parse-rss plugin. I need the same functionality but it doesn't exist and I believe it's not a simple task. The functionality required basically is to create a page in a segment for each item and the URL to the crawldb. Since the data already exists in the item element there is no reason to fetch the page (item). After that the only thing left is to index it. Any thoughts on how to achieve that goal? Gal. -Original Message- From: kauu [mailto:[EMAIL PROTECTED] Sent: Friday, January 26, 2007 4:17 AM To: nutch-dev@lucene.apache.org Subject: parse-rss make them items as different pages i want to crawl the rss feeds and parse them ,then index them and at last when search the content I just want that the hit just like an individual page. i don't know wether i tell u clearly. item title欧洲暴风雪后发制人 致航班延误交通混乱(组图)/title description暴风雪横扫欧洲,导致多次航班延误 1月24日,几架民航客机在德国斯图加特机场内等待去除机身上冰雪。1月24日,工作人员在德国南部的慕尼黑机场清扫飞机跑道上的积雪。 据报道,迟来的暴风雪连续两天横扫中... /description linkhttp://news.sohu.com/20070125/n247833568.shtml/link category搜狐焦点图新闻/category author[EMAIL PROTECTED]/author pubDateThu, 25 Jan 2007 11:29:11 +0800/pubDate commentshttp://comment.news.sohu.com/comment/topic.jsp?id=247833847 /comments /item this one item in an rss file i want nutch deal with an item like an individual page. so i search something in this item,the nutch return it as a hit. so ... any one can tell me how to do about ? any reply will be appreciated -- www.babatu.com -- www.babatu.com
Re: parse-rss make them items as different pages
who can tell me where and how to build a nutch document in nutch-0.8.1? for example , one html page is a document , but i want to detach a document to several ones . On 1/27/07, kauu [EMAIL PROTECTED] wrote: that's the right thing. i think we should to do some thing when nutch fetch a page successfully, judge if a rss then create as many pages as the items' number.i don't know whether it work. In the other hand , we can do some thing in the segment just like what u say . i don't know that whether we can write a plugin to get the functionality. anyone who can give me some hint? On 1/26/07, Gal Nitzan [EMAIL PROTECTED] wrote: Hi Kauu, The functionality you require doesn't exist in the current parse-rss plugin. I need the same functionality but it doesn't exist and I believe it's not a simple task. The functionality required basically is to create a page in a segment for each item and the URL to the crawldb. Since the data already exists in the item element there is no reason to fetch the page (item). After that the only thing left is to index it. Any thoughts on how to achieve that goal? Gal. -Original Message- From: kauu [mailto:[EMAIL PROTECTED] Sent: Friday, January 26, 2007 4:17 AM To: nutch-dev@lucene.apache.org Subject: parse-rss make them items as different pages i want to crawl the rss feeds and parse them ,then index them and at last when search the content I just want that the hit just like an individual page. i don't know wether i tell u clearly. item title欧洲暴风雪后发制人 致航班延误交通混乱(组图)/title description暴风雪横扫欧洲,导致多次航班延误 1月24日,几架民航客机在德国斯图加特机场内等待去除机身上冰雪。1月24日,工作人员在德国南部的慕尼黑机场清扫飞机跑道上的积雪。 据报道,迟来的暴风雪连续两天横扫中... /description linkhttp://news.sohu.com/20070125/n247833568.shtml /link category搜狐焦点图新闻/category author[EMAIL PROTECTED]/author pubDateThu, 25 Jan 2007 11:29:11 +0800/pubDate comments http://comment.news.sohu.com/comment/topic.jsp?id=247833847/comments /item this one item in an rss file i want nutch deal with an item like an individual page. so i search something in this item,the nutch return it as a hit. so ... any one can tell me how to do about ? any reply will be appreciated -- www.babatu.com -- www.babatu.com -- www.babatu.com
Re: parse-rss make them items as different pages
that's right ,but in the other word , i just need to index the exact information in a page .but in real ,the real world pages contain lots of spam ,so i just want to index the description. On 1/27/07, sishen [EMAIL PROTECTED] wrote: On 1/26/07, Gal Nitzan [EMAIL PROTECTED] wrote: Hi Kauu, The functionality you require doesn't exist in the current parse-rss plugin. I need the same functionality but it doesn't exist and I believe it's not a simple task. The functionality required basically is to create a page in a segment for each item and the URL to the crawldb. Since the data already exists in the item element there is no reason to fetch the page (item). After that the only thing left is to index it. I don't think so. The data in description is not completed. So to fetch the page through the link is needed. Any thoughts on how to achieve that goal? Gal. -Original Message- From: kauu [mailto:[EMAIL PROTECTED] Sent: Friday, January 26, 2007 4:17 AM To: nutch-dev@lucene.apache.org Subject: parse-rss make them items as different pages i want to crawl the rss feeds and parse them ,then index them and at last when search the content I just want that the hit just like an individual page. i don't know wether i tell u clearly. item title欧洲暴风雪后发制人 致航班延误交通混乱(组图)/title description暴风雪横扫欧洲,导致多次航班延误 1月24日,几架民航客机在德国斯图加特机场内等待去除机身上冰雪。1月24日,工作人员在德国南部的慕尼黑机场清扫飞机跑道上的积雪。 据报道,迟来的暴风雪连续两天横扫中... /description linkhttp://news.sohu.com/20070125/n247833568.shtml/link category搜狐焦点图新闻/category author[EMAIL PROTECTED]/author pubDateThu, 25 Jan 2007 11:29:11 +0800/pubDate comments http://comment.news.sohu.com/comment/topic.jsp?id=247833847 /comments /item this one item in an rss file i want nutch deal with an item like an individual page. so i search something in this item,the nutch return it as a hit. so ... any one can tell me how to do about ? any reply will be appreciated -- www.babatu.com -- www.babatu.com
parse-rss test problem
I can't test my parse-rss pluging in the nutch-0.8.1 I just can't test the default rsstest.rss file. 2007-01-25 17:04:34,703 INFO conf.Configuration (Configuration.java:getConfResourceAsInputStream(340)) - found resource parse-plugins.xml at file:/E:/work/digibot_news/build_tt/parse-plugins.xml 2007-01-25 17:04:35,328 WARN parse.rss (?:invoke0(?)) - org.apache.commons.feedparser.FeedParserException: java.lang.NoClassDefFoundError: org/jdom/Parent 2007-01-25 17:04:35,328 WARN parse.rss (?:invoke0(?)) - at org.apache.commons.feedparser.FeedParserImpl.parse(FeedParserImpl.java:191) 2007-01-25 17:04:35,343 WARN parse.rss (?:invoke0(?)) - at org.apache.commons.feedparser.FeedParserImpl.parse(FeedParserImpl.java:75) 2007-01-25 17:04:35,343 WARN parse.rss (?:invoke0(?)) - at org.apache.nutch.parse.rss.RSSParser.getParse(RSSParser.java:92) 2007-01-25 17:04:35,343 WARN parse.rss (?:invoke0(?)) - at org.apache.nutch.parse.ParseUtil.parseByExtensionId(ParseUtil.java:132) 2007-01-25 17:04:35,343 WARN parse.rss (?:invoke0(?)) - at org.apache.nutch.parse.rss.TestRSSParser.testIt(TestRSSParser.java:91) 2007-01-25 17:04:35,343 WARN parse.rss (?:invoke0(?)) - at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 2007-01-25 17:04:35,343 WARN parse.rss (?:invoke0(?)) - at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) 2007-01-25 17:04:35,359 WARN parse.rss (?:invoke0(?)) - at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) 2007-01-25 17:04:35,359 WARN parse.rss (?:invoke0(?)) - at java.lang.reflect.Method.invoke(Unknown Source) 2007-01-25 17:04:35,359 WARN parse.rss (?:invoke0(?)) - at junit.framework.TestCase.runTest(TestCase.java:154) 2007-01-25 17:04:35,359 WARN parse.rss (?:invoke0(?)) - at junit.framework.TestCase.runBare(TestCase.java:127) 2007-01-25 17:04:35,359 WARN parse.rss (?:invoke0(?)) - at junit.framework.TestResult$1.protect(TestResult.java:106) 2007-01-25 17:04:35,359 WARN parse.rss (?:invoke0(?)) - at junit.framework.TestResult.runProtected(TestResult.java:124) 2007-01-25 17:04:35,375 WARN parse.rss (?:invoke0(?)) - at junit.framework.TestResult.run(TestResult.java:109) 2007-01-25 17:04:35,375 WARN parse.rss (?:invoke0(?)) - at junit.framework.TestCase.run(TestCase.java:118) 2007-01-25 17:04:35,375 WARN parse.rss (?:invoke(?)) - at junit.framework.TestSuite.runTest(TestSuite.java:208) 2007-01-25 17:04:35,375 WARN parse.rss (?:invoke(?)) - at junit.framework.TestSuite.run(TestSuite.java:203) 2007-01-25 17:04:35,375 WARN parse.rss (?:invoke(?)) - at org.eclipse.jdt.internal.junit.runner.junit3.JUnit3TestReference.run(JUnit3T estReference.java:128) 2007-01-25 17:04:35,375 WARN parse.rss (?:invoke(?)) - at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:3 8) 2007-01-25 17:04:35,375 WARN parse.rss (?:invoke(?)) - at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRu nner.java:460) 2007-01-25 17:04:35,406 WARN parse.rss (?:invoke(?)) - at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRu nner.java:673) 2007-01-25 17:04:35,421 WARN parse.rss (?:invoke(?)) - at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner. java:386) 2007-01-25 17:04:35,421 WARN parse.rss (?:invoke(?)) - at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner .java:196) 2007-01-25 17:04:35,421 WARN parse.rss (?:invoke(?)) - Caused by: java.lang.NoClassDefFoundError: org/jdom/Parent 2007-01-25 17:04:35,421 WARN parse.rss (?:invoke(?)) - at org.jaxen.jdom.JDOMXPath.init(JDOMXPath.java:100) 2007-01-25 17:04:35,421 WARN parse.rss (?:invoke(?)) - at org.apache.commons.feedparser.RSSFeedParser.parse(RSSFeedParser.java:65) 2007-01-25 17:04:35,421 WARN parse.rss (?:invoke(?)) - at org.apache.commons.feedparser.FeedParserImpl.parse(FeedParserImpl.java:185) 2007-01-25 17:04:35,421 WARN parse.rss (?:invoke(?)) - ... 22 more 2007-01-25 17:04:35,421 WARN parse.rss (RSSParser.java:getParse(100)) - nutch:parse-rss:RSSParser Exception: java.lang.NoClassDefFoundError: org/jdom/Parent 2007-01-25 17:04:35,437 WARN parse.ParseUtil (ParseUtil.java:parseByExtensionId(138)) - Unable to successfully parse content file:/E:/work/digibot_news/rsstest.rss of type
Re: Fetcher2
please give us the url,thx On 1/25/07, chee wu [EMAIL PROTECTED] wrote: Just appended the portion for .81 to NUTCH-339 - Original Message - From: Armel T. Nene [EMAIL PROTECTED] To: nutch-dev@lucene.apache.org Sent: Thursday, January 25, 2007 8:06 AM Subject: RE: Fetcher2 Chee, Can you make the code available through Jira. Thanks, Armel - Armel T. Nene iDNA Solutions Tel: +44 (207) 257 6124 Mobile: +44 (788) 695 0483 http://blog.idna-solutions.com -Original Message- From: chee wu [mailto:[EMAIL PROTECTED] Sent: 24 January 2007 03:59 To: nutch-dev@lucene.apache.org Subject: Re: Fetcher2 Thanks! I successfully port Fetcher2 to Nutch.81, it's prettyly easy... I can share the code,if any one want to use .. - Original Message - From: Andrzej Bialecki [EMAIL PROTECTED] To: nutch-dev@lucene.apache.org Sent: Tuesday, January 23, 2007 12:09 AM Subject: Re: Fetcher2 chee wu wrote: Fetcher2 should be a great help for me,but seems can't integrate with Nutch81. Any advice on how to use it based on .81? You would have to port it to Nutch 0.8.1 - e.g. change all Text occurences to UTF8, and most likely make other changes too ... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com -- www.babatu.com
parse-rss make them items as different pages
i want to crawl the rss feeds and parse them ,then index them and at last when search the content I just want that the hit just like an individual page. i don't know wether i tell u clearly. item title欧洲暴风雪后发制人 致航班延误交通混乱(组图)/title description暴风雪横扫欧洲,导致多次航班延误 1月24日,几架民航客机在德国斯图加特机场内等待去除机身上冰雪。1月24日,工作人员在德国南部的慕尼黑机场清扫飞机跑道上的积雪。 据报道,迟来的暴风雪连续两天横扫中... /description linkhttp://news.sohu.com/20070125/n247833568.shtml/link category搜狐焦点图新闻/category author[EMAIL PROTECTED]/author pubDateThu, 25 Jan 2007 11:29:11 +0800/pubDate commentshttp://comment.news.sohu.com/comment/topic.jsp?id=247833847/comments /item this one item in an rss file i want nutch deal with an item like an individual page. so i search something in this item,the nutch return it as a hit. so ... any one can tell me how to do about ? any reply will be appreciated -- www.babatu.com
Re: hi all:
thx very much ,i'll try it On 12/9/06, Sami Siren [EMAIL PROTECTED] wrote: 吴志敏 wrote: I want to read the stored segments to a xml file, but when I read the SegmentReader.java, I find that it 's not a simple thing. it's a hadoop's job to dump a text file. I just want to dump the segments' some content witch I interested to a xml. So some one can tell me hwo to do this, any reply will be appreciated! Segment data is basically just a bunch of files containing key-value pairs, so there's always the possibility of reading the data directly with help of: http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/io/SequenceFile.Reader.html To see what kind of object to expect you can just examine the beginning of file where there is some metadata stored - like class used for key and class used for value (that metadata is also available from methods of SequenceFile.Reader class). For example to read the contents of Content data from a segment one could use something like: SequenceFile.Reader reader = new SequenceFile.Reader(fs, path, conf); Text url = new Text(); //key Content content = new Content();//value while (reader.next(url, content)) { //now just use url and content the way you like } -- Sami Siren -- www.babatu.com
Re: Question on adaptive re-fetch plugin
yes, i 'm ur side On 11/23/06, Scott Green [EMAIL PROTECTED] wrote: Hi NUTCH-61(http://issues.apache.org/jira/browse/NUTCH-61) is about adaptive re-fetch plugin, and Jerome Charron had commented --Why not making FetchSchedule a new ExtensionPoint and then DefaultFetchSchedule and AdaptiveFetchSchedule some fetch schedule plugins? . I am for it. Maintaining non-offical nutch source is bitter to me. So why not provide another plugin and test it. When it is stable enough, we can merge them, right? - Scott -- www.babatu.com
How to start working with MapReduce?
anyone kown the detail of the process with the topic how to start working with MapReduce? i'v read something in the FAQ ,but i don't understand it very well , my version is 0.7.2, not 0.8x -- www.babatu.com
why can't build in the Linux with ant
hi : i get a problem now ,i can't build the nutch in the linux os with ant and my ant version is Apache Ant version 1.5.2-20 compiled on September 25 2003 the error is below so anyone get the same problem ?i need ur help Buildfile: build.xml BUILD FAILED file:/nutch/nutch-0.7.2/build.xml:20: Unexpected element dirname Total time: 1 second -- www.babatu.com
Re: implement thai lanaguage analyzer in nutch
i think you should learn the javacc ,then understand the analasis.jj then the thai will be resolved soon . just try it On 11/7/06, sanjeev [EMAIL PROTECTED] wrote: Hello, After playing around with nutch for a few months I was tying to implement the thai lanaguage analyzer for nutch. Downloaded the subversion version and compiled using ant - everything fine. Next - I didn't see any tutorial for thai - but i did see one for chinese at http://issues.apache.org/jira/browse/NUTCH-36?page=comments#action_62153 Tried following the same steps outlined above but ran into compiler errors ...type mismatch between lucene Token and nutch Token. Suffice to say I am back at square one as far as trying to implement the thai language analyzer for nutch. Can someone please outline for me the exact procedure for this ? Or point me to a tutorial which explains how to ? Would be highly obliged. Thanks. -- View this message in context: http://www.nabble.com/implement-thai-lanaguage-analyzer-in-nutch-tf2587282.html#a7214087 Sent from the Nutch - Dev mailing list archive at Nabble.com. -- www.babatu.com
Re: hi,how to use the ICTCLASCall
thanks any way On 3/27/06, Yong-gang Cao [EMAIL PROTECTED] wrote: Please visit http://chiefadminofficer.googlepages.com/mycodes for the source code of ICTCLASCaller and the DLL used by it. You also need to get the data files from ICTCLAShttp://www.nlp.org.cn/project/project.php?proj_id=6source site to run ICTCLASCaller. Notice: The codes and the DLL's usage is restricted by ICTCLAS copyright (NOT MINE). Details of usage are put into the comments of ICTCLASCaller.java. Good Luck! 2006/3/27, kauu [EMAIL PROTECTED]: hi all: i get a problem when I integrat Nutch-0.7.1 with an intelligent Chinese Lexical Analysis System. and i follow the next page: http://www.nutchhacks.com/ftopic391.phphighlight=chinese which wrote by *caoyuzhong *when ant my modified java files , javac told me that couldn't find the symbol caomo.ICTCLASCaller in this line private final static caomo.ICTCLASCaller spliter = new caomo.ICTCLASCaller(); so my question is how to deal with it? any reply will be appreciated! -- www.babatu.com -- http://spaces.msn.com/members/caomo Beijing University of Aeronautics and Astronautics (BeiHang University) P.B.: 2-53# MailBox, 37 Xueyuan Road ,Beijing, 100083 P.R.China -- www.babatu.com
Re: hi,how to use the ICTCLASCall
(preToken.startOffset()); + curToken.setEndOffset(preToken.startOffset() + + curTokenLength); + tokenStart++; + tokenArray = null; + i = preToken.startOffset(); + startSearch=i;//the start position in textArray for the next turn,if need. + continue; + } + } + + } + } + // + + j = 0; + if (textArray[i] == tokenArray[j]) { + + if (i + tokenArray.length - 1 = textArray.length) { + //do nothing? + } else { + + int k = i + 1; + for (j = 1; j tokenArray.length; j++) { + if (textArray[k++] != tokenArray[j]) + break; //not meets + } + if (j = tokenArray.length) { //meets + curToken.setStartOffset(i); + curToken.setEndOffset(i + tokenArray.length); + + i = i + tokenArray.length - 1; + tokenStart++; + startSearch=i;//the start position in textArray for the next turn,if need. + tokenArray = null; + } + } + } + } + if (tokenStart == tokens.length) + break; //have resetted all tokens + + if (tokenStart tokens.length ) { //next turn + curToken.setStartOffset(preToken.startOffset()); + curToken.setEndOffset(preToken.endOffset()); + + tokenStart++; //skip this token + + } + + }//the end of while(true) + } under the line: Token[] tokens = getTokens(text) in getSummary(String text, Query query); +resetTokenOffset(tokens, text); I perform Chinese word Segmentation after tokenizer and insert space between two Chinese words.So I need reset all tokens' startOffset and endOffset in Summarizer.java. To do this,I added method resetTokenOffset(Token[] tokens,String text) in Summarizer.java and I have to add two methods setStartOffset(int start) and setEndOffset(int end) in Lucene's Token.java. By the above four steps,Nutch can search Chinese web site nearly perfectly.You can try it.I just made Nutch to do it, but my solution is less perfect. If Chinese word segmentation could be done in NutchAnalysis.jj before tokenizer,then we don't need reset tokens' offset in Summarizer.java and everything will be perfect. But it seems too difficult to perform intelligent Chinese word segmentation in NutchAnalysis.jj.Even impossible?? Any suggestions? Buildfile: build.xml init: compile-core: [javac] Compiling 247 source files to E:\search\new\nutch- 0.7.1\build\classes [javac] E:\search\new\nutch- 0.7.1\src\java\org\apache\nutch\searcher\Query.java:408: unreported exception org.apache.nutch.analysis.ParseException ; must be caught or declared to be thrown [javac] return fixup(NutchAnalysis.parseQuery (queryString)); [javac] ^ [javac] E:\search\new\nutch- 0.7.1\src\java\org\apache\nutch\searcher\Summarizer.java:140: cannot find symbol [javac] symbol : method setStartOffset(int) [javac] location: class org.apache.lucene.analysis.Token [javac] curToken.setStartOffset( preToken.startOffset()); [javac] ^ [javac] E:\search\new\nutch- 0.7.1\src\java\org\apache\nutch\searcher\Summarizer.java:141: cannot find symbol [javac] symbol : method setEndOffset(int) [javac] location: class org.apache.lucene.analysis.Token [javac] curToken.setEndOffset( preToken.startOffset() + curTokenLength); [javac] ^ [javac] E:\search\new\nutch- 0.7.1\src\java\org\apache\nutch\searcher\Summarizer.java:164: cannot find symbol [javac] symbol : method setStartOffset(int) [javac] location: class org.apache.lucene.analysis.Token [javac] curToken.setStartOffset(i); [javac] ^ [javac] E:\search\new\nutch- 0.7.1\src\java\org\apache\nutch\searcher\Summarizer.java:165: cannot find symbol [javac] symbol : method setEndOffset(int) [javac] location: class org.apache.lucene.analysis.Token [javac] curToken.setEndOffset(i + tokenArray.length); [javac] ^ [javac] E:\search\new\nutch- 0.7.1\src\java\org\apache\nutch\searcher\Summarizer.java:179: cannot find symbol [javac] symbol : method setStartOffset(int) [javac] location: class org.apache.lucene.analysis.Token [javac] curToken.setStartOffset(preToken.startOffset ()); [javac] ^ [javac] E:\search\new\nutch- 0.7.1\src\java\org\apache\nutch\searcher\Summarizer.java:180: cannot find symbol [javac] symbol : method setEndOffset(int) [javac] location: class org.apache.lucene.analysis.Token [javac] curToken.setEndOffset (preToken.endOffset()); [javac] ^ [javac] Note: * uses or overrides a deprecated API. [javac] Note: Recompile with -Xlint:deprecation for details. [javac] Note: Some input files use unchecked or unsafe operations. [javac] Note: Recompile with -Xlint:unchecked for details. [javac] 7 errors BUILD FAILED E:\search\new\nutch-0.7.1\build.xml:70: Compile failed; see the compiler error output for details. Total time: 39 seconds On 3/27/06, kauu [EMAIL PROTECTED] wrote: i get
hi,how to use the ICTCLASCall
hi all: i get a problem when I integrat Nutch-0.7.1 with an intelligent Chinese Lexical Analysis System. and i follow the next page: http://www.nutchhacks.com/ftopic391.phphighlight=chinese which wrote by *caoyuzhong *when ant my modified java files , javac told me that couldn't find the symbol caomo.ICTCLASCaller in this line private final static caomo.ICTCLASCaller spliter = new caomo.ICTCLASCaller(); so my question is how to deal with it? any reply will be appreciated! -- www.babatu.com