How to deal with javascript urls?
for example: a href=javascript:customCss(6017162) id=customCssMenu test/a in fact, can nutch get content from such kind of urls?
Re: java.net.SocketTimeoutException: Read timed out
Oh. Thank you very much. 在06-4-14,Raghavendra Prabhu [EMAIL PROTECTED] 写道: Hi Elwin Just switch it to protocol-http in the conf file. (nutch-default.xml file) If you dont want to use threaded thing, change the number of threads in the configuration file. Have a limited number of threads fetching (Like as doug said) Rgds Prabhu On 4/14/06, Elwin [EMAIL PROTECTED] wrote: Hi Raghavendra Then how to use protocol-http instead of protocol-httpclient? Can I still use HttpResponse? 在 06-4-13,Raghavendra Prabhu[EMAIL PROTECTED] 写道: Hi Doug I am not sure whether this problem is entirely with bandwidth starving In some cases, having the protocol as protocol-http instead of protocol-httpclient seems to be fixing the problem. I am not sure but the above thing seemed to fix the problem Rgds Prabhu On 4/13/06, Elwin [EMAIL PROTECTED] wrote: In fact I'm not using the fetcher of nutch and I just call the HttpResponse in my own code, which is not multi-thread. 2006/4/13, Doug Cutting [EMAIL PROTECTED]: Elwin wrote: When I use the httpclient.HttpResponse to get http content in nutch, I often get SocketTimeoutExceptions. Can I solve this problem by enlarging the value of http.timeoutin conf file? Perhaps, if you're working with slow sites. But, more likely, you're using too many fetcher threads and exceeding your available bandwidth, causing threads to starve and timeout. Doug -- 《盖世豪侠》好评如潮,让无线收视居高不下, 无线高兴之余,仍未重用。周星驰岂是池中物, 喜剧天分既然崭露,当然不甘心受冷落,于是 转投电影界,在大银幕上一展风采。无线既得 千里马,又失千里马,当然后悔莫及。 -- 《盖世豪侠》好评如潮,让无线收视居高不下, 无线高兴之余,仍未重用。周星驰岂是池中物, 喜剧天分既然崭露,当然不甘心受冷落,于是 转投电影界,在大银幕上一展风采。无线既得 千里马,又失千里马,当然后悔莫及。 -- 《盖世豪侠》好评如潮,让无线收视居高不下, 无线高兴之余,仍未重用。周星驰岂是池中物, 喜剧天分既然崭露,当然不甘心受冷落,于是 转投电影界,在大银幕上一展风采。无线既得 千里马,又失千里马,当然后悔莫及。
Re: java.net.SocketTimeoutException: Read timed out
In fact I'm not using the fetcher of nutch and I just call the HttpResponse in my own code, which is not multi-thread. 2006/4/13, Doug Cutting [EMAIL PROTECTED]: Elwin wrote: When I use the httpclient.HttpResponse to get http content in nutch, I often get SocketTimeoutExceptions. Can I solve this problem by enlarging the value of http.timeout in conf file? Perhaps, if you're working with slow sites. But, more likely, you're using too many fetcher threads and exceeding your available bandwidth, causing threads to starve and timeout. Doug -- 《盖世豪侠》好评如潮,让无线收视居高不下, 无线高兴之余,仍未重用。周星驰岂是池中物, 喜剧天分既然崭露,当然不甘心受冷落,于是 转投电影界,在大银幕上一展风采。无线既得 千里马,又失千里马,当然后悔莫及。
Re: java.net.SocketTimeoutException: Read timed out
Hi Raghavendra Then how to use protocol-http instead of protocol-httpclient? Can I still use HttpResponse? 在 06-4-13,Raghavendra Prabhu[EMAIL PROTECTED] 写道: Hi Doug I am not sure whether this problem is entirely with bandwidth starving In some cases, having the protocol as protocol-http instead of protocol-httpclient seems to be fixing the problem. I am not sure but the above thing seemed to fix the problem Rgds Prabhu On 4/13/06, Elwin [EMAIL PROTECTED] wrote: In fact I'm not using the fetcher of nutch and I just call the HttpResponse in my own code, which is not multi-thread. 2006/4/13, Doug Cutting [EMAIL PROTECTED]: Elwin wrote: When I use the httpclient.HttpResponse to get http content in nutch, I often get SocketTimeoutExceptions. Can I solve this problem by enlarging the value of http.timeout in conf file? Perhaps, if you're working with slow sites. But, more likely, you're using too many fetcher threads and exceeding your available bandwidth, causing threads to starve and timeout. Doug -- 《盖世豪侠》好评如潮,让无线收视居高不下, 无线高兴之余,仍未重用。周星驰岂是池中物, 喜剧天分既然崭露,当然不甘心受冷落,于是 转投电影界,在大银幕上一展风采。无线既得 千里马,又失千里马,当然后悔莫及。 -- 《盖世豪侠》好评如潮,让无线收视居高不下, 无线高兴之余,仍未重用。周星驰岂是池中物, 喜剧天分既然崭露,当然不甘心受冷落,于是 转投电影界,在大银幕上一展风采。无线既得 千里马,又失千里马,当然后悔莫及。
java.net.SocketTimeoutException: Read timed out
When I use the httpclient.HttpResponse to get http content in nutch, I often get SocketTimeoutExceptions. Can I solve this problem by enlarging the value of http.timeout in conf file?
Inject url into a temp webdb
WebDBInjector injector = new WebDBInjector(dbWriter); I dynamically use the injector to inject urls into a temp empty webdb. Then I use Enumeration e = webdb.pages() to dump urls from that webdb, but it seems that I get nothing? Need I update the webdb after I inject urls? if so, how to update?
find duplicate urls in webdb
When I read pages out of a webdb and printed out the url of each page, I found two urls are just the same. Is it possible that two pages with the same url? -- 《盖世豪侠》好评如潮,让无线收视居高不下, 无线高兴之余,仍未重用。周星驰岂是池中物, 喜剧天分既然崭露,当然不甘心受冷落,于是 转投电影界,在大银幕上一展风采。无线既得 千里马,又失千里马,当然后悔莫及。
Re: About regex in the crawl-urlfilter.txt config file
Oh I have asked a silly question about regex, hehe. 2006/2/23, Jack Tang [EMAIL PROTECTED]: Hi I think in the url-filter it uses contain rather than match. /Jack On 2/23/06, Elwin [EMAIL PROTECTED] wrote: # accept hosts in MY.DOMAIN.NAME +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/ Will this pattern accept url like this http://MY.DOMAIN.NAME/([a-z0-9]*\.)*/? I think it's not, but in fact nutch can crawl and get urls like that in intranet crawl. Why? -- Keep Discovering ... ... http://www.jroller.com/page/jmars -- 《盖世豪侠》好评如潮,让无线收视居高不下, 无线高兴之余,仍未重用。周星驰岂是池中物, 喜剧天分既然崭露,当然不甘心受冷落,于是 转投电影界,在大银幕上一展风采。无线既得 千里马,又失千里马,当然后悔莫及。
Why Perl5 regular expressions?
Why the url filter of nutch use Perl5 regular expressions? Any benefits? -- 《盖世豪侠》好评如潮,让无线收视居高不下, 无线高兴之余,仍未重用。周星驰岂是池中物, 喜剧天分既然崭露,当然不甘心受冷落,于是 转投电影界,在大银幕上一展风采。无线既得 千里马,又失千里马,当然后悔莫及。
Re: AW: extract links problem with parse-html plugin
Yes, it's true, although it's not the cause of my problem. 在06-2-20,Piotr Kosiorowski [EMAIL PROTECTED] 写道: Hello, One more thing to check: property namedb.max.outlinks.per.page/name value100/value descriptionThe maximum number of outlinks that we'll process for a page. /description /property Regards Piotr Guenter, Matthias wrote: Hi Elwin Did you check the content limit? Otherwise the truncation occurs naturally, I guess property namehttp.content.limit/name value65536/value descriptionThe length limit for downloaded content, in bytes. If this value is nonnegative (=0), content longer than it will be truncated; otherwise, no truncation at all. /description /property Kind regards Matthias -Ursprüngliche Nachricht- Von: Elwin [mailto:[EMAIL PROTECTED] Gesendet: Freitag, 17. Februar 2006 09:36 An: nutch-user@lucene.apache.org Betreff: Re: extract links problem with parse-html plugin I have wrote a test class HtmlWrapper and here is some code: HtmlWrapper wrapper=new HtmlWrapper(); Content c=getHttpContent(http://blog.sina.com.cn/lm/hot/index.html;); String temp=new String(c.getContent()); System.out.println(temp); wrapper.parseHttpContent(c); // get all outlinks into a ArrayList ArrayList links=wrapper.getBlogLinks(); for(int i=0;ilinks.size();i++){ String urlString=(String)links.get(i); System.out.println(urlString); } I can only get a few of links from that page. The url is from a Chinese site; however you can just skip those non-Enligsh contents and just see the html elements. 2006/2/17, Guenter, Matthias [EMAIL PROTECTED]: Hi Elwin Can you provide samples of not working links and code? And put it into JIRA? Kind regards Matthias -Ursprüngliche Nachricht- Von: Elwin [mailto:[EMAIL PROTECTED] Gesendet: Fr 17.02.2006 08:51 An: nutch-user@lucene.apache.org Betreff: extract links problem with parse-html plugin It seems that the parse-html plguin may not process many pages well, because I have found that the plugin can't extract all valid links in a page when I test it in my code. I guess that it may be caused by the style of a html page? When I view source of a html page I used to parse, I saw that some elements in the source are segmented by some unrequired spaces. However, the situation is quiet often to the pages of large portal sites or news sites. -- 《盖世豪侠》好评如潮,让无线收视居高不下, 无线高兴之余,仍未重用。周星驰岂是池中物, 喜剧天分既然崭露,当然不甘心受冷落,于是 转投电影界,在大银幕上一展风采。无线既得 千里马,又失千里马,当然后悔莫及。 -- 《盖世豪侠》好评如潮,让无线收视居高不下, 无线高兴之余,仍未重用。周星驰岂是池中物, 喜剧天分既然崭露,当然不甘心受冷落,于是 转投电影界,在大银幕上一展风采。无线既得 千里马,又失千里马,当然后悔莫及。
Re: AW: extract links problem with parse-html plugin
No I don't try to do that. I just use the default paser for the plguin. It seems that it works well now. Thx. 2006/2/20, Andrzej Bialecki [EMAIL PROTECTED]: Elwin wrote: Yes, it's true, although it's not the cause of my problem. Did you try to use the alternative HTML parser (TagSoup) supported by the plugin? You need to set a property parser.html.impl to tagsoup. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com -- 《盖世豪侠》好评如潮,让无线收视居高不下, 无线高兴之余,仍未重用。周星驰岂是池中物, 喜剧天分既然崭露,当然不甘心受冷落,于是 转投电影界,在大银幕上一展风采。无线既得 千里马,又失千里马,当然后悔莫及。
Re: AW: extract links problem with parse-html plugin
I will try it. Many thanks. 2006/2/20, Andrzej Bialecki [EMAIL PROTECTED]: Elwin wrote: No I don't try to do that. I just use the default paser for the plguin. It seems that it works well now. Thx. I often find TagSoup performing better than NekoHTML. In case of some grave HTML errors Neko tends to simply truncate the document, while TagSoup just keeps on truckin'. This is especially true for pages with multiple html elements, where Neko ignores all elements but the first one, while TagSoup just treats any html elements inside a document like any other nested element. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com -- 《盖世豪侠》好评如潮,让无线收视居高不下, 无线高兴之余,仍未重用。周星驰岂是池中物, 喜剧天分既然崭露,当然不甘心受冷落,于是 转投电影界,在大银幕上一展风采。无线既得 千里马,又失千里马,当然后悔莫及。
Re: No Accents
I think maybe you could add a mapping between these letters. 2006/2/20, Franz Werfel [EMAIL PROTECTED]: Hello, Sorry this is probably in the documentation somewhere, but I couldn't find it. How to index and search accented words without accents? For example: Portégé (a model for Toshiba laptops) would be indexed as portege; and the search for portégé would be equivalent to the search for portege and find either Portégé, Portegé, portége, portege, etc. This is how Google works; maybe Nutch do the same by default? Currently, by default (0.7.1), Portégé is indexed as portégé and found only if searched for portégé or Portégé (but not portege). This is all the most useful considering users in the US do not have easy access to accented letters on their keywords... Thanks, Frank. -- 《盖世豪侠》好评如潮,让无线收视居高不下, 无线高兴之余,仍未重用。周星驰岂是池中物, 喜剧天分既然崭露,当然不甘心受冷落,于是 转投电影界,在大银幕上一展风采。无线既得 千里马,又失千里马,当然后悔莫及。
Re: Content-based Crawl vs Link-based Crawl?
Hi Howie, Thank you for valuable suggestion. I will consider it carefully. As I'm going to parse non-English (actually Chinese) pages, so I think maybe regular expressions are not very useful to me. I decide to integrate some simple date mining techniques to achieve it. 2006/2/19, Howie Wang [EMAIL PROTECTED]: I think doing this sort of thing works out very well for niche search engines. Analyzing the contents of the page takes up some time, but it's just milliseconds per page. If you contrast this with actually fetching a page that you don't want (several seconds * num pages), you can see that the time savings are very much in your favor. I'm not sure if you'd create a URLFilter since I don't think that gives you easy access to the page contents. You could do it in an HtmlParseFilter. Just copy the parse-html plugin, look for the bit of code where the Outlinks array is set. Then filter that Outlinks array as you see fit. One thing to be careful about is using regular expressions in Java to analyze the page contents. I've had lots of problems with hanging using java.util.regex. I get this with perfectly legal regex's, and it's only on certain pages that I get problems. It's not as big a problem for me since most of my regex stuff is during the indexing phase, and it's easy to re-index. If it happens during the fetch, it's a bigger pain, since you have to recover from an aborted fetch. So you might want to do lots of small crawls, instead of big full crawls. Howie I think this can be done by using a plug-in like url filter, but it seems to cause the performance problem of the crawling process. So I'd like to listen to your opinions. Is it possible or meaningful to crawl not just by links but contents or terms? -- 《盖世豪侠》好评如潮,让无线收视居高不下, 无线高兴之余,仍未重用。周星驰岂是池中物, 喜剧天分既然崭露,当然不甘心受冷落,于是 转投电影界,在大银幕上一展风采。无线既得 千里马,又失千里马,当然后悔莫及。
Content-based Crawl vs Link-based Crawl?
As nutch crawls web pages from links to links by extracting outlinks from the page. For example, we can check if the link text contains some keywords from a dictionary to decide whether or not to crawl it. Moreover, we can check if the content of a page fetched by an outlink contains some keywords from a dictionary. I think this can be done by using a plug-in like url filter, but it seems to cause the performance problem of the crawling process. So I'd like to listen to your opinions. Is it possible or meaningful to crawl not just by links but contents or terms?
Re: extract links problem with parse-html plugin
I have wrote a test class HtmlWrapper and here is some code: HtmlWrapper wrapper=new HtmlWrapper(); Content c=getHttpContent(http://blog.sina.com.cn/lm/hot/index.html;); String temp=new String(c.getContent()); System.out.println(temp); wrapper.parseHttpContent(c); // get all outlinks into a ArrayList ArrayList links=wrapper.getBlogLinks(); for(int i=0;ilinks.size();i++){ String urlString=(String)links.get(i); System.out.println(urlString); } I can only get a few of links from that page. The url is from a Chinese site; however you can just skip those non-Enligsh contents and just see the html elements. 2006/2/17, Guenter, Matthias [EMAIL PROTECTED]: Hi Elwin Can you provide samples of not working links and code? And put it into JIRA? Kind regards Matthias -Ursprüngliche Nachricht- Von: Elwin [mailto:[EMAIL PROTECTED] Gesendet: Fr 17.02.2006 08:51 An: nutch-user@lucene.apache.org Betreff: extract links problem with parse-html plugin It seems that the parse-html plguin may not process many pages well, because I have found that the plugin can't extract all valid links in a page when I test it in my code. I guess that it may be caused by the style of a html page? When I view source of a html page I used to parse, I saw that some elements in the source are segmented by some unrequired spaces. However, the situation is quiet often to the pages of large portal sites or news sites. -- 《盖世豪侠》好评如潮,让无线收视居高不下, 无线高兴之余,仍未重用。周星驰岂是池中物, 喜剧天分既然崭露,当然不甘心受冷落,于是 转投电影界,在大银幕上一展风采。无线既得 千里马,又失千里马,当然后悔莫及。
Re: extract links problem with parse-html plugin
Hi *Guenter* I think you are right. Although I haven't restarted code, but I have checked the last url I got from that page, which is just in the middle of the page, so it seems that the page has been truncated. Many thanks! 在06-2-17,Guenter, Matthias [EMAIL PROTECTED] 写道: Hi Elwin Did you check the content limit? Otherwise the truncation occurs naturally, I guess property namehttp.content.limit/name value65536/value descriptionThe length limit for downloaded content, in bytes. If this value is nonnegative (=0), content longer than it will be truncated; otherwise, no truncation at all. /description /property Kind regards Matthias -Ursprüngliche Nachricht- Von: Elwin [mailto:[EMAIL PROTECTED] Gesendet: Freitag, 17. Februar 2006 09:36 An: nutch-user@lucene.apache.org Betreff: Re: extract links problem with parse-html plugin I have wrote a test class HtmlWrapper and here is some code: HtmlWrapper wrapper=new HtmlWrapper(); Content c=getHttpContent(http://blog.sina.com.cn/lm/hot/index.html;); String temp=new String(c.getContent()); System.out.println(temp); wrapper.parseHttpContent(c); // get all outlinks into a ArrayList ArrayList links=wrapper.getBlogLinks(); for(int i=0;ilinks.size();i++){ String urlString=(String)links.get(i); System.out.println(urlString); } I can only get a few of links from that page. The url is from a Chinese site; however you can just skip those non-Enligsh contents and just see the html elements. 2006/2/17, Guenter, Matthias [EMAIL PROTECTED]: Hi Elwin Can you provide samples of not working links and code? And put it into JIRA? Kind regards Matthias -Ursprüngliche Nachricht- Von: Elwin [mailto:[EMAIL PROTECTED] Gesendet: Fr 17.02.2006 08:51 An: nutch-user@lucene.apache.org Betreff: extract links problem with parse-html plugin It seems that the parse-html plguin may not process many pages well, because I have found that the plugin can't extract all valid links in a page when I test it in my code. I guess that it may be caused by the style of a html page? When I view source of a html page I used to parse, I saw that some elements in the source are segmented by some unrequired spaces. However, the situation is quiet often to the pages of large portal sites or news sites. -- 《盖世豪侠》好评如潮,让无线收视居高不下, 无线高兴之余,仍未重用。周星驰岂是池中物, 喜剧天分既然崭露,当然不甘心受冷落,于是 转投电影界,在大银幕上一展风采。无线既得 千里马,又失千里马,当然后悔莫及。 -- 《盖世豪侠》好评如潮,让无线收视居高不下, 无线高兴之余,仍未重用。周星驰岂是池中物, 喜剧天分既然崭露,当然不甘心受冷落,于是 转投电影界,在大银幕上一展风采。无线既得 千里马,又失千里马,当然后悔莫及。
Question about fExtensionPoints in PluginRepository.java
fExtensionPoints is a HashMap. How about two plugins that extend the same Extension Point for the code fExtensionPoints.put(xpId, point)?
Re: Duplicate urls in urls file
Did you achieve it by extending nutch with a plugin? I think it's possible to achieve it in a URLFilter plugin to filter rss feed links. 2006/2/16, Hasan Diwan [EMAIL PROTECTED]: Elwin: On 13/02/06, Elwin [EMAIL PROTECTED] wrote: Do you use fixed set of rss feeds for crawl or discover rss feeds dynamically? Before I broke the script, it would take the URL, grab the feeds specified from the link tags, then parse them. I suspect this is similar to what the parse-rss plugin does, but I have not had the chance to look at it as yet. -- Cheers, Hasan Diwan [EMAIL PROTECTED] -- 《盖世豪侠》好评如潮,让无线收视居高不下, 无线高兴之余,仍未重用。周星驰岂是池中物, 喜剧天分既然崭露,当然不甘心受冷落,于是 转投电影界,在大银幕上一展风采。无线既得 千里马,又失千里马,当然后悔莫及。
Re: Duplicate urls in urls file
Hi, Hasan Do you use fixed set of rss feeds for crawl or discover rss feeds dynamically? 2006/2/14, Hasan Diwan [EMAIL PROTECTED]: I've written a perl script to build up a urls file to crawl from RSS feeds. Will nutch handle duplicate URLs in the crawl file or would that logic need to be in my perl script? -- Cheers, Hasan Diwan [EMAIL PROTECTED] -- 《盖世豪侠》好评如潮,让无线收视居高不下, 无线高兴之余,仍未重用。周星驰岂是池中物, 喜剧天分既然崭露,当然不甘心受冷落,于是 转投电影界,在大银幕上一展风采。无线既得 千里马,又失千里马,当然后悔莫及。
Problem in debugging codes that using nutch api
I have written some test codes using nutch api. As nutch-default.xml and nutch-site.xml are included in nutch-0.7.jar, can I debug my code with these files in a conf dir instead of binding in the jar file? Besides, how can I refer to other files like mime-types.xml in my code? Where does NutchConf load them?
Why are other config files not included in nutch-0.7.jar
other than nutch-default.xml and nutch-site.xml.
How to control contents to be indexed?
In the process of crawling and indexing, some pages are just used as temporary links to the pages I want to index, so how can I control those kinds of pages not being indexed? Or which part of nutch should I extend?
Re: How to control contents to be indexed?
Thank you. But what I want to crawl are just from the internent and certainly I can't control them. 2006/2/10, Vanderdray, Jacob [EMAIL PROTECTED]: If you control the temporary links pages, then just add a robots meta tag. Take a look at http://www.robotstxt.org/wc/meta-user.html to see what your options are. Jake. -Original Message- From: Elwin [mailto:[EMAIL PROTECTED] Sent: Friday, February 10, 2006 4:38 AM To: nutch-user@lucene.apache.org Subject: How to control contents to be indexed? In the process of crawling and indexing, some pages are just used as temporary links to the pages I want to index, so how can I control those kinds of pages not being indexed? Or which part of nutch should I extend? -- 《盖世豪侠》好评如潮,让无线收视居高不下, 无线高兴之余,仍未重用。周星驰岂是池中物, 喜剧天分既然崭露,当然不甘心受冷落,于是 转投电影界,在大银幕上一展风采。无线既得 千里马,又失千里马,当然后悔莫及。
Re: Which version of rss does parse-rss plugin support?
According to the code: theOutlinks.add(new Outlink(r.getLink(), r .getDescription())); I can see that item description is also included. However, when I tried with this feed: http://kgrimm.bravejournal.com/feed.rss I can only get the title and description for channel and failed to search the words in item description. From the above code, the item description is combined with outlink url, is it used as contentTitle for that url? When the outlink is fetched and parsed, I think new data about that url will be generated. 在06-2-11,Chris Mattmann [EMAIL PROTECTED] 写道: Hi, the contentTitle will be a concatenation of the titles of the RSS Channels that we've parsed. So the titles of the RSS Channels are what delivered for indexing, right? They're certainly part of it, but not the only part. The concatenation of the titles of the RSS Channels are what is delivered for the title portion of indexing. If I want the indexer to include more information about a rss file (such as item descriptions), can I just concatenate them to the contentTitle? They're already there. There is a variable called index text: ultimately that variable includes the item descriptions, along with the channel descriptions. That, along with the title portion of indexing is the full set of textual data delivered by the parser for indexing. So, it already includes that information. Check out lines 137, and 161 in the parser to see what I mean. Also, check out lines 204-207, which are: ParseData parseData = new ParseData(ParseStatus.STATUS_SUCCESS, contentTitle.toString(), outlinks, content.getMetadata()); parseData.setConf(this.conf); return new ParseImpl(indexText.toString(), parseData); You can see that the return from the Parser, i.e., the ParseImpl, includes both the indexText, along with the parse data (that contains the title text). Now, if you wanted to add any other metadata gleaned from the RSS to the title text, or the content text, you can always modify the code to do that in your own environment. The RSS Parser plugin returns a full channel model and item model that can be extended and used for those purposes. Hope that helps! Cheers, Chris 在06-2-6,Chris Mattmann [EMAIL PROTECTED] 写道: Hi there, That should work: however, the biggest problem will be making sure that text/xml is actually the content type of the RSS that you are parsing, which you'll have little or no control over. Check out this previous post of mine on the list to get a better idea of what the real issue is: http://www.nabble.com/Re:-Crawling-blogs-and-RSS-p1153844.html G'luck! Cheers, Chris __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 Phone: 818-354-8810 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology. -Original Message- From: 盖世豪侠 [mailto:[EMAIL PROTECTED] Sent: Saturday, February 04, 2006 11:40 PM To: nutch-user@lucene.apache.org Subject: Re: Which version of rss does parse-rss plugin support? Hi Chris How do I change the plugin.xml? For example, if I want to crawl rss files end with xml, just add a new element? implementation id=org.apache.nutch.parse.rss.RSSParser class=org.apache.nutch.parse.rss.RSSParser contentType=application/rss+xml pathSuffix=rss/ implementation id=org.apache.nutch.parse.rss.RSSParser class=org.apache.nutch.parse.rss.RSSParser contentType=application/rss+xml pathSuffix=xml/ Am I right? 在06-2-3,Chris Mattmann [EMAIL PROTECTED] 写道: Hi there, Sure it will, you just have to configure it to do that. Pop over to $NUTCH_HOME/src/plugin/parse-rss/ and open up plugin.xml. In there there is an attribute called pathSuffix. Change that to handle whatever type of rss file you want to crawl. That will work locally. For web-based crawls, you need to make sure that the content type being returned for your RSS content matches the content type specified in the plugin.xml file that parse-rss claims to support. Note that you might not have * a lot * of success with being able to control the content type for rss files returned by web servers. I've seen a LOT of inconsistency among the way that they're configured by the administrators, etc. However, just to