Here is an example of the feed: http://www.microsoft-careers.com/feeds/microsoft%20job2web/?src=RSS
bin/nutch indexchecker http://www.microsoft-careers.com/feeds/microsoft%20job2web/?src=RSS It returns: title : Microsoft - Custom Search microsoft-job2web title : jobexport.xml On Mon, Feb 24, 2014 at 2:59 PM, John Lafitte <[email protected]>wrote: > I think the channel/image/title idea was probably wrong. It looks like > the extra title field is actually the http header Content-Disposition: > inline; filename="jobexport.xml". I can email you the url privately of the > specific RSS feed I'm using for this issue, but since it's a client site > I'm not sure I'm allowed to post it publicly. > > I'm using the default parser-plugins.xml which shows parse-tika before > feed. I don't have feed in my plugin.includes, but if I modify > parser-plugins.xml and plugin.includes to try to favor the feed I still get > the same results. I might be doing something wrong. > > > > > On Mon, Feb 24, 2014 at 2:20 PM, Sebastian Nagel < > [email protected]> wrote: > >> Hi John, >> >> can you attach an (short) example document to reproduce the problem? >> I was not able to reproduce it with the example in >> http://de.wikipedia.org/wiki/RSS >> which contains channel/image/title. >> >> Which parser plugin is used: "feed" or "parse-tika"? >> (In doubt, please, add the value of property "plugin.includes") >> >> Sebastian >> >> >> On 02/24/2014 08:31 PM, John Lafitte wrote: >> > I am using Nutch 1.7 and Solr 4.6.1. I'm having a problem with indexing >> > RSS that has channel/title then channel/image/title it tries to add >> both of >> > them then fails when doing solrindex because title isn't multivalued. >> > >> > I've used nutch indexchecker and I see the two titles being returned. >> The >> > extra title is the value that in the content-disposition: filename http >> > header. I only see one title when I run nutch readseg. So I'm a little >> > confused why it's >> > >> > I have made title multivalued in the solr schema and it seems to work >> that >> > way, but it seems wrong to me. Documents shouldn't have more than one >> > title. What is the correct way to fix this? >> > >> >> >

