I think I found it already documented, I just wasn't searching for the right plugin:
https://issues.apache.org/jira/browse/NUTCH-1140 There is a patch there, I will try that. Thanks for the help! On Mon, Feb 24, 2014 at 3:41 PM, Sebastian Nagel <[email protected] > wrote: > Hi John, > > reproduced. It's the index-more plugin which adds the second title > from Content-Disposition header field. If index-more is removed > from plugin.includes the second title disappears: > > % bin/nutch indexchecker > -Dplugin.includes="parse-tika|index-basic|protocol-http" \ > http://www.microsoft-careers.com/feeds/microsoft%20job2web/?src=RSS > > Maybe that's an option for a quick work-around. > > You can also open an issue at https://issues.apache.org/jira/browse/Nutch. > We'll check it. The authors of index-more explicitly add (with intension > to overwrite?) > the content-disposition title, cf. code comments: > > // Reset title if we see non-standard HTTP header "Content-Disposition". > // It's a good indication that content provider wants filename therein > // be used as the title of this url. > > // Patterns used to extract filename from possible non-standard > // HTTP header "Content-Disposition". Typically it looks like: > // Content-Disposition: inline; filename="foo.ppt" > > Thanks, > Sebastian > > > On 02/24/2014 10:23 PM, John Lafitte wrote: > > Here is an example of the feed: > > > > http://www.microsoft-careers.com/feeds/microsoft%20job2web/?src=RSS > > > > bin/nutch indexchecker > > http://www.microsoft-careers.com/feeds/microsoft%20job2web/?src=RSS > > > > It returns: > > title : Microsoft - Custom Search microsoft-job2web > > title : jobexport.xml > > > > > > On Mon, Feb 24, 2014 at 2:59 PM, John Lafitte <[email protected] > >wrote: > > > >> I think the channel/image/title idea was probably wrong. It looks like > >> the extra title field is actually the http header Content-Disposition: > >> inline; filename="jobexport.xml". I can email you the url privately of > the > >> specific RSS feed I'm using for this issue, but since it's a client site > >> I'm not sure I'm allowed to post it publicly. > >> > >> I'm using the default parser-plugins.xml which shows parse-tika before > >> feed. I don't have feed in my plugin.includes, but if I modify > >> parser-plugins.xml and plugin.includes to try to favor the feed I still > get > >> the same results. I might be doing something wrong. > >> > >> > >> > >> > >> On Mon, Feb 24, 2014 at 2:20 PM, Sebastian Nagel < > >> [email protected]> wrote: > >> > >>> Hi John, > >>> > >>> can you attach an (short) example document to reproduce the problem? > >>> I was not able to reproduce it with the example in > >>> http://de.wikipedia.org/wiki/RSS > >>> which contains channel/image/title. > >>> > >>> Which parser plugin is used: "feed" or "parse-tika"? > >>> (In doubt, please, add the value of property "plugin.includes") > >>> > >>> Sebastian > >>> > >>> > >>> On 02/24/2014 08:31 PM, John Lafitte wrote: > >>>> I am using Nutch 1.7 and Solr 4.6.1. I'm having a problem with > indexing > >>>> RSS that has channel/title then channel/image/title it tries to add > >>> both of > >>>> them then fails when doing solrindex because title isn't multivalued. > >>>> > >>>> I've used nutch indexchecker and I see the two titles being returned. > >>> The > >>>> extra title is the value that in the content-disposition: filename > http > >>>> header. I only see one title when I run nutch readseg. So I'm a > little > >>>> confused why it's > >>>> > >>>> I have made title multivalued in the solr schema and it seems to work > >>> that > >>>> way, but it seems wrong to me. Documents shouldn't have more than one > >>>> title. What is the correct way to fix this? > >>>> > >>> > >>> > >> > > > >

