I think I found it already documented, I just wasn't searching for the
right plugin:

https://issues.apache.org/jira/browse/NUTCH-1140

There is a patch there, I will try that.  Thanks for the help!


On Mon, Feb 24, 2014 at 3:41 PM, Sebastian Nagel <[email protected]
> wrote:

> Hi John,
>
> reproduced. It's the index-more plugin which adds the second title
> from Content-Disposition header field. If index-more is removed
> from plugin.includes the second title disappears:
>
> % bin/nutch indexchecker
> -Dplugin.includes="parse-tika|index-basic|protocol-http" \
>      http://www.microsoft-careers.com/feeds/microsoft%20job2web/?src=RSS
>
> Maybe that's an option for a quick work-around.
>
> You can also open an issue at https://issues.apache.org/jira/browse/Nutch.
> We'll check it. The authors of index-more explicitly add (with intension
> to overwrite?)
> the content-disposition title, cf. code comments:
>
>   // Reset title if we see non-standard HTTP header "Content-Disposition".
>   // It's a good indication that content provider wants filename therein
>   // be used as the title of this url.
>
>   // Patterns used to extract filename from possible non-standard
>   // HTTP header "Content-Disposition". Typically it looks like:
>   // Content-Disposition: inline; filename="foo.ppt"
>
> Thanks,
> Sebastian
>
>
> On 02/24/2014 10:23 PM, John Lafitte wrote:
> > Here is an example of the feed:
> >
> > http://www.microsoft-careers.com/feeds/microsoft%20job2web/?src=RSS
> >
> > bin/nutch indexchecker
> > http://www.microsoft-careers.com/feeds/microsoft%20job2web/?src=RSS
> >
> > It returns:
> > title : Microsoft - Custom Search microsoft-job2web
> > title : jobexport.xml
> >
> >
> > On Mon, Feb 24, 2014 at 2:59 PM, John Lafitte <[email protected]
> >wrote:
> >
> >> I think the channel/image/title idea was probably wrong.  It looks like
> >> the extra title field is actually the http header Content-Disposition:
> >> inline; filename="jobexport.xml".  I can email you the url privately of
> the
> >> specific RSS feed I'm using for this issue, but since it's a client site
> >> I'm not sure I'm allowed to post it publicly.
> >>
> >> I'm using the default parser-plugins.xml which shows parse-tika before
> >> feed.  I don't have feed in my plugin.includes, but if I modify
> >> parser-plugins.xml and plugin.includes to try to favor the feed I still
> get
> >> the same results.  I might be doing something wrong.
> >>
> >>
> >>
> >>
> >> On Mon, Feb 24, 2014 at 2:20 PM, Sebastian Nagel <
> >> [email protected]> wrote:
> >>
> >>> Hi John,
> >>>
> >>> can you attach an (short) example document to reproduce the problem?
> >>> I was not able to reproduce it with the example in
> >>> http://de.wikipedia.org/wiki/RSS
> >>> which contains channel/image/title.
> >>>
> >>> Which parser plugin is used: "feed" or "parse-tika"?
> >>> (In doubt, please, add the value of property "plugin.includes")
> >>>
> >>> Sebastian
> >>>
> >>>
> >>> On 02/24/2014 08:31 PM, John Lafitte wrote:
> >>>> I am using Nutch 1.7 and Solr 4.6.1.  I'm having a problem with
> indexing
> >>>> RSS that has channel/title then channel/image/title it tries to add
> >>> both of
> >>>> them then fails when doing solrindex because title isn't multivalued.
> >>>>
> >>>> I've used nutch indexchecker and I see the two titles being returned.
> >>>  The
> >>>> extra title is the value that in the content-disposition: filename
> http
> >>>> header.  I only see one title when I run nutch readseg.  So I'm a
> little
> >>>> confused why it's
> >>>>
> >>>> I have made title multivalued in the solr schema and it seems to work
> >>> that
> >>>> way, but it seems wrong to me.  Documents shouldn't have more than one
> >>>> title.  What is the correct way to fix this?
> >>>>
> >>>
> >>>
> >>
> >
>
>

Reply via email to