Re: Which version of rss does parse-rss plugin support?
Hi, the contentTitle will be a concatenation of the titles of the RSS Channels that we've parsed. So the titles of the RSS Channels are what delivered for indexing, right? They're certainly part of it, but not the only part. The concatenation of the titles of the RSS Channels are what is delivered for the title portion of indexing. If I want the indexer to include more information about a rss file (such as item descriptions), can I just concatenate them to the contentTitle? They're already there. There is a variable called index text: ultimately that variable includes the item descriptions, along with the channel descriptions. That, along with the title portion of indexing is the full set of textual data delivered by the parser for indexing. So, it already includes that information. Check out lines 137, and 161 in the parser to see what I mean. Also, check out lines 204-207, which are: ParseData parseData = new ParseData(ParseStatus.STATUS_SUCCESS, contentTitle.toString(), outlinks, content.getMetadata()); parseData.setConf(this.conf); return new ParseImpl(indexText.toString(), parseData); You can see that the return from the Parser, i.e., the ParseImpl, includes both the indexText, along with the parse data (that contains the title text). Now, if you wanted to add any other metadata gleaned from the RSS to the title text, or the content text, you can always modify the code to do that in your own environment. The RSS Parser plugin returns a full channel model and item model that can be extended and used for those purposes. Hope that helps! Cheers, Chris 在06-2-6,Chris Mattmann [EMAIL PROTECTED] 写道: Hi there, That should work: however, the biggest problem will be making sure that text/xml is actually the content type of the RSS that you are parsing, which you'll have little or no control over. Check out this previous post of mine on the list to get a better idea of what the real issue is: http://www.nabble.com/Re:-Crawling-blogs-and-RSS-p1153844.html G'luck! Cheers, Chris __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 Phone: 818-354-8810 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology. -Original Message- From: 盖世豪侠 [mailto:[EMAIL PROTECTED] Sent: Saturday, February 04, 2006 11:40 PM To: nutch-user@lucene.apache.org Subject: Re: Which version of rss does parse-rss plugin support? Hi Chris How do I change the plugin.xml? For example, if I want to crawl rss files end with xml, just add a new element? implementation id=org.apache.nutch.parse.rss.RSSParser class=org.apache.nutch.parse.rss.RSSParser contentType=application/rss+xml pathSuffix=rss/ implementation id=org.apache.nutch.parse.rss.RSSParser class=org.apache.nutch.parse.rss.RSSParser contentType=application/rss+xml pathSuffix=xml/ Am I right? 在06-2-3,Chris Mattmann [EMAIL PROTECTED] 写道: Hi there, Sure it will, you just have to configure it to do that. Pop over to $NUTCH_HOME/src/plugin/parse-rss/ and open up plugin.xml. In there there is an attribute called pathSuffix. Change that to handle whatever type of rss file you want to crawl. That will work locally. For web-based crawls, you need to make sure that the content type being returned for your RSS content matches the content type specified in the plugin.xml file that parse-rss claims to support. Note that you might not have * a lot * of success with being able to control the content type for rss files returned by web servers. I've seen a LOT of inconsistency among the way that they're configured by the administrators, etc. However, just to let you know, there are some people in the group that are working on a solution to addressing this. Hope that helps. Cheers, Chris On 2/3/06 7:16 AM, 盖世豪侠 [EMAIL PROTECTED] wrote: Hi *Chris,* The files of RSS 1.0 have a postfix of rdf. So willthe parser recognize it automatically as a rss file? 在06-2-3,Chris Mattmann [EMAIL PROTECTED] 写道: Hi there, parse-rss is based on commons-feedparser (http://jakarta.apache.org/commons/sandbox/feedparser). From the feedparser website: ...commons-feedparser supports all versions of RSS (0.9, 0.91, 0.92, 1.0, and 2.0), Atom 0.5 (and future versions) as well as easy ad hoc extension and RSS 1.0 modules capability... Hope that helps
Re: Which version of rss does parse-rss plugin support?
According to the code: theOutlinks.add(new Outlink(r.getLink(), r .getDescription())); I can see that item description is also included. However, when I tried with this feed: http://kgrimm.bravejournal.com/feed.rss I can only get the title and description for channel and failed to search the words in item description. From the above code, the item description is combined with outlink url, is it used as contentTitle for that url? When the outlink is fetched and parsed, I think new data about that url will be generated. 在06-2-11,Chris Mattmann [EMAIL PROTECTED] 写道: Hi, the contentTitle will be a concatenation of the titles of the RSS Channels that we've parsed. So the titles of the RSS Channels are what delivered for indexing, right? They're certainly part of it, but not the only part. The concatenation of the titles of the RSS Channels are what is delivered for the title portion of indexing. If I want the indexer to include more information about a rss file (such as item descriptions), can I just concatenate them to the contentTitle? They're already there. There is a variable called index text: ultimately that variable includes the item descriptions, along with the channel descriptions. That, along with the title portion of indexing is the full set of textual data delivered by the parser for indexing. So, it already includes that information. Check out lines 137, and 161 in the parser to see what I mean. Also, check out lines 204-207, which are: ParseData parseData = new ParseData(ParseStatus.STATUS_SUCCESS, contentTitle.toString(), outlinks, content.getMetadata()); parseData.setConf(this.conf); return new ParseImpl(indexText.toString(), parseData); You can see that the return from the Parser, i.e., the ParseImpl, includes both the indexText, along with the parse data (that contains the title text). Now, if you wanted to add any other metadata gleaned from the RSS to the title text, or the content text, you can always modify the code to do that in your own environment. The RSS Parser plugin returns a full channel model and item model that can be extended and used for those purposes. Hope that helps! Cheers, Chris 在06-2-6,Chris Mattmann [EMAIL PROTECTED] 写道: Hi there, That should work: however, the biggest problem will be making sure that text/xml is actually the content type of the RSS that you are parsing, which you'll have little or no control over. Check out this previous post of mine on the list to get a better idea of what the real issue is: http://www.nabble.com/Re:-Crawling-blogs-and-RSS-p1153844.html G'luck! Cheers, Chris __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 Phone: 818-354-8810 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology. -Original Message- From: 盖世豪侠 [mailto:[EMAIL PROTECTED] Sent: Saturday, February 04, 2006 11:40 PM To: nutch-user@lucene.apache.org Subject: Re: Which version of rss does parse-rss plugin support? Hi Chris How do I change the plugin.xml? For example, if I want to crawl rss files end with xml, just add a new element? implementation id=org.apache.nutch.parse.rss.RSSParser class=org.apache.nutch.parse.rss.RSSParser contentType=application/rss+xml pathSuffix=rss/ implementation id=org.apache.nutch.parse.rss.RSSParser class=org.apache.nutch.parse.rss.RSSParser contentType=application/rss+xml pathSuffix=xml/ Am I right? 在06-2-3,Chris Mattmann [EMAIL PROTECTED] 写道: Hi there, Sure it will, you just have to configure it to do that. Pop over to $NUTCH_HOME/src/plugin/parse-rss/ and open up plugin.xml. In there there is an attribute called pathSuffix. Change that to handle whatever type of rss file you want to crawl. That will work locally. For web-based crawls, you need to make sure that the content type being returned for your RSS content matches the content type specified in the plugin.xml file that parse-rss claims to support. Note that you might not have * a lot * of success with being able to control the content type for rss files returned by web servers. I've seen a LOT of inconsistency among the way that they're configured by the administrators, etc. However, just
RE: Which version of rss does parse-rss plugin support?
Hi there, That should work: however, the biggest problem will be making sure that text/xml is actually the content type of the RSS that you are parsing, which you'll have little or no control over. Check out this previous post of mine on the list to get a better idea of what the real issue is: http://www.nabble.com/Re:-Crawling-blogs-and-RSS-p1153844.html G'luck! Cheers, Chris __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 Phone: 818-354-8810 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology. -Original Message- From: 盖世豪侠 [mailto:[EMAIL PROTECTED] Sent: Saturday, February 04, 2006 11:40 PM To: nutch-user@lucene.apache.org Subject: Re: Which version of rss does parse-rss plugin support? Hi Chris How do I change the plugin.xml? For example, if I want to crawl rss files end with xml, just add a new element? implementation id=org.apache.nutch.parse.rss.RSSParser class=org.apache.nutch.parse.rss.RSSParser contentType=application/rss+xml pathSuffix=rss/ implementation id=org.apache.nutch.parse.rss.RSSParser class=org.apache.nutch.parse.rss.RSSParser contentType=application/rss+xml pathSuffix=xml/ Am I right? 在06-2-3,Chris Mattmann [EMAIL PROTECTED] 写道: Hi there, Sure it will, you just have to configure it to do that. Pop over to $NUTCH_HOME/src/plugin/parse-rss/ and open up plugin.xml. In there there is an attribute called pathSuffix. Change that to handle whatever type of rss file you want to crawl. That will work locally. For web-based crawls, you need to make sure that the content type being returned for your RSS content matches the content type specified in the plugin.xml file that parse-rss claims to support. Note that you might not have * a lot * of success with being able to control the content type for rss files returned by web servers. I've seen a LOT of inconsistency among the way that they're configured by the administrators, etc. However, just to let you know, there are some people in the group that are working on a solution to addressing this. Hope that helps. Cheers, Chris On 2/3/06 7:16 AM, 盖世豪侠 [EMAIL PROTECTED] wrote: Hi *Chris,* The files of RSS 1.0 have a postfix of rdf. So willthe parser recognize it automatically as a rss file? 在06-2-3,Chris Mattmann [EMAIL PROTECTED] 写道: Hi there, parse-rss is based on commons-feedparser (http://jakarta.apache.org/commons/sandbox/feedparser). From the feedparser website: ...commons-feedparser supports all versions of RSS (0.9, 0.91, 0.92, 1.0, and 2.0), Atom 0.5 (and future versions) as well as easy ad hoc extension and RSS 1.0 modules capability... Hope that helps. Thanks, Chris On 2/3/06 6:46 AM, 盖世豪侠 [EMAIL PROTECTED] wrote: I see the test file is of version 0.91. Does the plugin support higher versions like 1.0 or 2.0? -- 《盖世豪侠》好评如潮,让无线收视居高不下,无线高兴之余,仍未重用。周 星驰岂是池中物,喜剧天 分 既 然崭露,当然不甘心受冷落,于是转投电影界,在大银幕上一展风采。无线既 得千里马,又失千里马, 当 然 后悔莫及。 -- 《盖世豪侠》好评如潮,让无线收视居高不下,无线高兴之余,仍未重用。周星 驰岂是池中物,喜剧天分既 然崭露,当然不甘心受冷落,于是转投电影界,在大银幕上一展风采。无线既得 千里马,又失千里马,当然 后悔莫及。 -- 《盖世豪侠》好评如潮,让无线收视居高不下,无线高兴之余,仍未重用。周星驰岂 是池中物,喜剧天分既然崭露,当然不甘心受冷落,于是转投电影界,在大银幕上一 展风采。无线既得千里马,又失千里马,当然后悔莫及。
Re: Which version of rss does parse-rss plugin support?
Hi Chris, Thank you for your post and I've read it through. So, you mean I should also add these lines to the plugin.xml in most cases: implementation id=org.apache.nutch.parse.rss.RSSParser class=org.apache.nutch.parse.rss.RSSParser contentType=application/rss+xml pathSuffix=rss/ ... implementation id=org.apache.nutch.parse.rss.RSSParser class=org.apache.nutch.parse.rss.RSSParser contentType=text/xml pathSuffix=xml/ implementation id=org.apache.nutch.parse.rss.RSSParser class=org.apache.nutch.parse.rss.RSSParser contentType=text/xml pathSuffix=rss/ 在06-2-6,Chris Mattmann [EMAIL PROTECTED] 写道: Hi there, That should work: however, the biggest problem will be making sure that text/xml is actually the content type of the RSS that you are parsing, which you'll have little or no control over. Check out this previous post of mine on the list to get a better idea of what the real issue is: http://www.nabble.com/Re:-Crawling-blogs-and-RSS-p1153844.html G'luck! Cheers, Chris __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 Phone: 818-354-8810 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology. -Original Message- From: 盖世豪侠 [mailto:[EMAIL PROTECTED] Sent: Saturday, February 04, 2006 11:40 PM To: nutch-user@lucene.apache.org Subject: Re: Which version of rss does parse-rss plugin support? Hi Chris How do I change the plugin.xml? For example, if I want to crawl rss files end with xml, just add a new element? implementation id=org.apache.nutch.parse.rss.RSSParser class=org.apache.nutch.parse.rss.RSSParser contentType=application/rss+xml pathSuffix=rss/ implementation id=org.apache.nutch.parse.rss.RSSParser class=org.apache.nutch.parse.rss.RSSParser contentType=application/rss+xml pathSuffix=xml/ Am I right? 在06-2-3,Chris Mattmann [EMAIL PROTECTED] 写道: Hi there, Sure it will, you just have to configure it to do that. Pop over to $NUTCH_HOME/src/plugin/parse-rss/ and open up plugin.xml. In there there is an attribute called pathSuffix. Change that to handle whatever type of rss file you want to crawl. That will work locally. For web-based crawls, you need to make sure that the content type being returned for your RSS content matches the content type specified in the plugin.xml file that parse-rss claims to support. Note that you might not have * a lot * of success with being able to control the content type for rss files returned by web servers. I've seen a LOT of inconsistency among the way that they're configured by the administrators, etc. However, just to let you know, there are some people in the group that are working on a solution to addressing this. Hope that helps. Cheers, Chris On 2/3/06 7:16 AM, 盖世豪侠 [EMAIL PROTECTED] wrote: Hi *Chris,* The files of RSS 1.0 have a postfix of rdf. So willthe parser recognize it automatically as a rss file? 在06-2-3,Chris Mattmann [EMAIL PROTECTED] 写道: Hi there, parse-rss is based on commons-feedparser (http://jakarta.apache.org/commons/sandbox/feedparser). From the feedparser website: ...commons-feedparser supports all versions of RSS (0.9, 0.91, 0.92, 1.0, and 2.0), Atom 0.5 (and future versions) as well as easy ad hoc extension and RSS 1.0 modules capability... Hope that helps. Thanks, Chris On 2/3/06 6:46 AM, 盖世豪侠 [EMAIL PROTECTED] wrote: I see the test file is of version 0.91. Does the plugin support higher versions like 1.0 or 2.0? -- 《盖世豪侠》好评如潮,让无线收视居高不下,无线高兴之余,仍未重用。周 星驰岂是池中物,喜剧天 分 既 然崭露,当然不甘心受冷落,于是转投电影界,在大银幕上一展风采。无线既 得千里马,又失千里马, 当 然 后悔莫及。 -- 《盖世豪侠》好评如潮,让无线收视居高不下,无线高兴之余,仍未重用。周星 驰岂是池中物,喜剧天分既 然崭露,当然不甘心受冷落,于是转投电影界,在大银幕上一展风采。无线既得 千里马,又失千里马,当然 后悔莫及。 -- 《盖世豪侠》好评如潮,让无线收视居高不下,无线高兴之余,仍未重用。周星驰岂 是池中物,喜剧天分既然崭露,当然不甘心受冷落,于是转投电影界,在大银幕上一 展风采。无线既得千里马,又失千里马,当然后悔莫及。 -- 《盖世豪侠》好评如潮,让无线收视居高不下,无线高兴之余,仍未重用。周星驰岂是池中物,喜剧天分既然崭露,当然不甘心受冷落,于是转投电影界,在大银幕上一展风采。无线既得千里马,又失千里马,当然
Re: Which version of rss does parse-rss plugin support?
Hi Chris How do I change the plugin.xml? For example, if I want to crawl rss files end with xml, just add a new element? implementation id=org.apache.nutch.parse.rss.RSSParser class=org.apache.nutch.parse.rss.RSSParser contentType=application/rss+xml pathSuffix=rss/ implementation id=org.apache.nutch.parse.rss.RSSParser class=org.apache.nutch.parse.rss.RSSParser contentType=application/rss+xml pathSuffix=xml/ Am I right? 在06-2-3,Chris Mattmann [EMAIL PROTECTED] 写道: Hi there, Sure it will, you just have to configure it to do that. Pop over to $NUTCH_HOME/src/plugin/parse-rss/ and open up plugin.xml. In there there is an attribute called pathSuffix. Change that to handle whatever type of rss file you want to crawl. That will work locally. For web-based crawls, you need to make sure that the content type being returned for your RSS content matches the content type specified in the plugin.xml file that parse-rss claims to support. Note that you might not have * a lot * of success with being able to control the content type for rss files returned by web servers. I've seen a LOT of inconsistency among the way that they're configured by the administrators, etc. However, just to let you know, there are some people in the group that are working on a solution to addressing this. Hope that helps. Cheers, Chris On 2/3/06 7:16 AM, 盖世豪侠 [EMAIL PROTECTED] wrote: Hi *Chris,* The files of RSS 1.0 have a postfix of rdf. So willthe parser recognize it automatically as a rss file? 在06-2-3,Chris Mattmann [EMAIL PROTECTED] 写道: Hi there, parse-rss is based on commons-feedparser (http://jakarta.apache.org/commons/sandbox/feedparser). From the feedparser website: ...commons-feedparser supports all versions of RSS (0.9, 0.91, 0.92, 1.0, and 2.0), Atom 0.5 (and future versions) as well as easy ad hoc extension and RSS 1.0 modules capability... Hope that helps. Thanks, Chris On 2/3/06 6:46 AM, 盖世豪侠 [EMAIL PROTECTED] wrote: I see the test file is of version 0.91. Does the plugin support higher versions like 1.0 or 2.0? -- 《盖世豪侠》好评如潮,让无线收视居高不下,无线高兴之余,仍未重用。周星驰岂是池中物,喜剧天 分 既 然崭露,当然不甘心受冷落,于是转投电影界,在大银幕上一展风采。无线既得千里马,又失千里马, 当 然 后悔莫及。 -- 《盖世豪侠》好评如潮,让无线收视居高不下,无线高兴之余,仍未重用。周星驰岂是池中物,喜剧天分既 然崭露,当然不甘心受冷落,于是转投电影界,在大银幕上一展风采。无线既得千里马,又失千里马,当然 后悔莫及。 -- 《盖世豪侠》好评如潮,让无线收视居高不下,无线高兴之余,仍未重用。周星驰岂是池中物,喜剧天分既然崭露,当然不甘心受冷落,于是转投电影界,在大银幕上一展风采。无线既得千里马,又失千里马,当然后悔莫及。
Which version of rss does parse-rss plugin support?
I see the test file is of version 0.91. Does the plugin support higher versions like 1.0 or 2.0? -- 《盖世豪侠》好评如潮,让无线收视居高不下,无线高兴之余,仍未重用。周星驰岂是池中物,喜剧天分既然崭露,当然不甘心受冷落,于是转投电影界,在大银幕上一展风采。无线既得千里马,又失千里马,当然后悔莫及。
Re: Which version of rss does parse-rss plugin support?
Hi there, parse-rss is based on commons-feedparser (http://jakarta.apache.org/commons/sandbox/feedparser). From the feedparser website: ...commons-feedparser supports all versions of RSS (0.9, 0.91, 0.92, 1.0, and 2.0), Atom 0.5 (and future versions) as well as easy ad hoc extension and RSS 1.0 modules capability... Hope that helps. Thanks, Chris On 2/3/06 6:46 AM, 盖世豪侠 [EMAIL PROTECTED] wrote: I see the test file is of version 0.91. Does the plugin support higher versions like 1.0 or 2.0? -- 《盖世豪侠》好评如潮,让无线收视居高不下,无线高兴之余,仍未重用。周星驰岂是池中物,喜剧天分既 然崭露,当然不甘心受冷落,于是转投电影界,在大银幕上一展风采。无线既得千里马,又失千里马,当然 后悔莫及。
Re: Which version of rss does parse-rss plugin support?
Hi *Chris,* The files of RSS 1.0 have a postfix of rdf. So willthe parser recognize it automatically as a rss file? 在06-2-3,Chris Mattmann [EMAIL PROTECTED] 写道: Hi there, parse-rss is based on commons-feedparser (http://jakarta.apache.org/commons/sandbox/feedparser). From the feedparser website: ...commons-feedparser supports all versions of RSS (0.9, 0.91, 0.92, 1.0, and 2.0), Atom 0.5 (and future versions) as well as easy ad hoc extension and RSS 1.0 modules capability... Hope that helps. Thanks, Chris On 2/3/06 6:46 AM, 盖世豪侠 [EMAIL PROTECTED] wrote: I see the test file is of version 0.91. Does the plugin support higher versions like 1.0 or 2.0? -- 《盖世豪侠》好评如潮,让无线收视居高不下,无线高兴之余,仍未重用。周星驰岂是池中物,喜剧天分既 然崭露,当然不甘心受冷落,于是转投电影界,在大银幕上一展风采。无线既得千里马,又失千里马,当然 后悔莫及。 -- 《盖世豪侠》好评如潮,让无线收视居高不下,无线高兴之余,仍未重用。周星驰岂是池中物,喜剧天分既然崭露,当然不甘心受冷落,于是转投电影界,在大银幕上一展风采。无线既得千里马,又失千里马,当然后悔莫及。