> https://issues.apache.org/jira/browse/NUTCH-1140 Thanks for digging this up!
> Why is index-more adding this? Maybe, to have some title for MIME types which have no title (e.g., plain text). That could be the intension. The code is old (> 9 years) and the web has changed since. The original RFC http://www.ietf.org/rfc/rfc1806.txt for the content-disposition header is even older (1995). On 02/24/2014 10:40 PM, John Lafitte wrote: > Okay, I invoked it the way you mentioned and I get the same result. > However, I tried it without index-more included and I no longer have the > additional title. Why is index-more adding this? > > > On Mon, Feb 24, 2014 at 3:24 PM, Sebastian Nagel <[email protected] >> wrote: > >>> I'm not sure I'm allowed to post it publicly. >> A minimalistic and anonymized example would be fine. >> However, if it's really the HTTP header it will >> be hard to make it reproducible. >> >>> I'm using the default parser-plugins.xml which shows parse-tika before >>> feed. I don't have feed in my plugin.includes, but if I modify >>> parser-plugins.xml and plugin.includes to try to favor the feed I still >> get >>> the same results. I might be doing something wrong. >> >> It's possible to set plugin.includes (and other properties) just for >> tools like indexchecker, parsechecker, etc: >> >> % bin/nutch indexchecker >> -Dplugin.includes="feed|index-(basic|more)|protocol-http" .../rss.xml >> >> >> On 02/24/2014 09:59 PM, John Lafitte wrote: >>> I think the channel/image/title idea was probably wrong. It looks like >> the >>> extra title field is actually the http header Content-Disposition: >> inline; >>> filename="jobexport.xml". I can email you the url privately of the >>> specific RSS feed I'm using for this issue, but since it's a client site >>> I'm not sure I'm allowed to post it publicly. >>> >>> I'm using the default parser-plugins.xml which shows parse-tika before >>> feed. I don't have feed in my plugin.includes, but if I modify >>> parser-plugins.xml and plugin.includes to try to favor the feed I still >> get >>> the same results. I might be doing something wrong. >>> >>> >>> >>> >>> On Mon, Feb 24, 2014 at 2:20 PM, Sebastian Nagel < >> [email protected] >>>> wrote: >>> >>>> Hi John, >>>> >>>> can you attach an (short) example document to reproduce the problem? >>>> I was not able to reproduce it with the example in >>>> http://de.wikipedia.org/wiki/RSS >>>> which contains channel/image/title. >>>> >>>> Which parser plugin is used: "feed" or "parse-tika"? >>>> (In doubt, please, add the value of property "plugin.includes") >>>> >>>> Sebastian >>>> >>>> >>>> On 02/24/2014 08:31 PM, John Lafitte wrote: >>>>> I am using Nutch 1.7 and Solr 4.6.1. I'm having a problem with >> indexing >>>>> RSS that has channel/title then channel/image/title it tries to add >> both >>>> of >>>>> them then fails when doing solrindex because title isn't multivalued. >>>>> >>>>> I've used nutch indexchecker and I see the two titles being returned. >>>> The >>>>> extra title is the value that in the content-disposition: filename http >>>>> header. I only see one title when I run nutch readseg. So I'm a >> little >>>>> confused why it's >>>>> >>>>> I have made title multivalued in the solr schema and it seems to work >>>> that >>>>> way, but it seems wrong to me. Documents shouldn't have more than one >>>>> title. What is the correct way to fix this? >>>>> >>>> >>>> >>> >> >> >

