Hi John,
reproduced. It's the index-more plugin which adds the second title
from Content-Disposition header field. If index-more is removed
from plugin.includes the second title disappears:
% bin/nutch indexchecker
-Dplugin.includes="parse-tika|index-basic|protocol-http" \
http://www.microsoft-careers.com/feeds/microsoft%20job2web/?src=RSS
Maybe that's an option for a quick work-around.
You can also open an issue at https://issues.apache.org/jira/browse/Nutch.
We'll check it. The authors of index-more explicitly add (with intension to
overwrite?)
the content-disposition title, cf. code comments:
// Reset title if we see non-standard HTTP header "Content-Disposition".
// It's a good indication that content provider wants filename therein
// be used as the title of this url.
// Patterns used to extract filename from possible non-standard
// HTTP header "Content-Disposition". Typically it looks like:
// Content-Disposition: inline; filename="foo.ppt"
Thanks,
Sebastian
On 02/24/2014 10:23 PM, John Lafitte wrote:
> Here is an example of the feed:
>
> http://www.microsoft-careers.com/feeds/microsoft%20job2web/?src=RSS
>
> bin/nutch indexchecker
> http://www.microsoft-careers.com/feeds/microsoft%20job2web/?src=RSS
>
> It returns:
> title : Microsoft - Custom Search microsoft-job2web
> title : jobexport.xml
>
>
> On Mon, Feb 24, 2014 at 2:59 PM, John Lafitte
> <[email protected]>wrote:
>
>> I think the channel/image/title idea was probably wrong. It looks like
>> the extra title field is actually the http header Content-Disposition:
>> inline; filename="jobexport.xml". I can email you the url privately of the
>> specific RSS feed I'm using for this issue, but since it's a client site
>> I'm not sure I'm allowed to post it publicly.
>>
>> I'm using the default parser-plugins.xml which shows parse-tika before
>> feed. I don't have feed in my plugin.includes, but if I modify
>> parser-plugins.xml and plugin.includes to try to favor the feed I still get
>> the same results. I might be doing something wrong.
>>
>>
>>
>>
>> On Mon, Feb 24, 2014 at 2:20 PM, Sebastian Nagel <
>> [email protected]> wrote:
>>
>>> Hi John,
>>>
>>> can you attach an (short) example document to reproduce the problem?
>>> I was not able to reproduce it with the example in
>>> http://de.wikipedia.org/wiki/RSS
>>> which contains channel/image/title.
>>>
>>> Which parser plugin is used: "feed" or "parse-tika"?
>>> (In doubt, please, add the value of property "plugin.includes")
>>>
>>> Sebastian
>>>
>>>
>>> On 02/24/2014 08:31 PM, John Lafitte wrote:
>>>> I am using Nutch 1.7 and Solr 4.6.1. I'm having a problem with indexing
>>>> RSS that has channel/title then channel/image/title it tries to add
>>> both of
>>>> them then fails when doing solrindex because title isn't multivalued.
>>>>
>>>> I've used nutch indexchecker and I see the two titles being returned.
>>> The
>>>> extra title is the value that in the content-disposition: filename http
>>>> header. I only see one title when I run nutch readseg. So I'm a little
>>>> confused why it's
>>>>
>>>> I have made title multivalued in the solr schema and it seems to work
>>> that
>>>> way, but it seems wrong to me. Documents shouldn't have more than one
>>>> title. What is the correct way to fix this?
>>>>
>>>
>>>
>>
>