Okay, I invoked it the way you mentioned and I get the same result. However, I tried it without index-more included and I no longer have the additional title. Why is index-more adding this?
On Mon, Feb 24, 2014 at 3:24 PM, Sebastian Nagel <[email protected] > wrote: > > I'm not sure I'm allowed to post it publicly. > A minimalistic and anonymized example would be fine. > However, if it's really the HTTP header it will > be hard to make it reproducible. > > > I'm using the default parser-plugins.xml which shows parse-tika before > > feed. I don't have feed in my plugin.includes, but if I modify > > parser-plugins.xml and plugin.includes to try to favor the feed I still > get > > the same results. I might be doing something wrong. > > It's possible to set plugin.includes (and other properties) just for > tools like indexchecker, parsechecker, etc: > > % bin/nutch indexchecker > -Dplugin.includes="feed|index-(basic|more)|protocol-http" .../rss.xml > > > On 02/24/2014 09:59 PM, John Lafitte wrote: > > I think the channel/image/title idea was probably wrong. It looks like > the > > extra title field is actually the http header Content-Disposition: > inline; > > filename="jobexport.xml". I can email you the url privately of the > > specific RSS feed I'm using for this issue, but since it's a client site > > I'm not sure I'm allowed to post it publicly. > > > > I'm using the default parser-plugins.xml which shows parse-tika before > > feed. I don't have feed in my plugin.includes, but if I modify > > parser-plugins.xml and plugin.includes to try to favor the feed I still > get > > the same results. I might be doing something wrong. > > > > > > > > > > On Mon, Feb 24, 2014 at 2:20 PM, Sebastian Nagel < > [email protected] > >> wrote: > > > >> Hi John, > >> > >> can you attach an (short) example document to reproduce the problem? > >> I was not able to reproduce it with the example in > >> http://de.wikipedia.org/wiki/RSS > >> which contains channel/image/title. > >> > >> Which parser plugin is used: "feed" or "parse-tika"? > >> (In doubt, please, add the value of property "plugin.includes") > >> > >> Sebastian > >> > >> > >> On 02/24/2014 08:31 PM, John Lafitte wrote: > >>> I am using Nutch 1.7 and Solr 4.6.1. I'm having a problem with > indexing > >>> RSS that has channel/title then channel/image/title it tries to add > both > >> of > >>> them then fails when doing solrindex because title isn't multivalued. > >>> > >>> I've used nutch indexchecker and I see the two titles being returned. > >> The > >>> extra title is the value that in the content-disposition: filename http > >>> header. I only see one title when I run nutch readseg. So I'm a > little > >>> confused why it's > >>> > >>> I have made title multivalued in the solr schema and it seems to work > >> that > >>> way, but it seems wrong to me. Documents shouldn't have more than one > >>> title. What is the correct way to fix this? > >>> > >> > >> > > > >

