Re: multivalues returned unexpectedly

Sebastian Nagel Mon, 24 Feb 2014 13:43:00 -0800

Hi John,

reproduced. It's the index-more plugin which adds the second title
from Content-Disposition header field. If index-more is removed
from plugin.includes the second title disappears:


% bin/nutch indexchecker 
-Dplugin.includes="parse-tika|index-basic|protocol-http" \
     http://www.microsoft-careers.com/feeds/microsoft%20job2web/?src=RSS

Maybe that's an option for a quick work-around.

You can also open an issue at https://issues.apache.org/jira/browse/Nutch.
We'll check it. The authors of index-more explicitly add (with intension to 
overwrite?)
the content-disposition title, cf. code comments:

  // Reset title if we see non-standard HTTP header "Content-Disposition".
  // It's a good indication that content provider wants filename therein
  // be used as the title of this url.

  // Patterns used to extract filename from possible non-standard
  // HTTP header "Content-Disposition". Typically it looks like:
  // Content-Disposition: inline; filename="foo.ppt"

Thanks,
Sebastian


On 02/24/2014 10:23 PM, John Lafitte wrote:
> Here is an example of the feed:
> 
> http://www.microsoft-careers.com/feeds/microsoft%20job2web/?src=RSS
> 
> bin/nutch indexchecker
> http://www.microsoft-careers.com/feeds/microsoft%20job2web/?src=RSS
> 
> It returns:
> title : Microsoft - Custom Search microsoft-job2web
> title : jobexport.xml
> 
> 
> On Mon, Feb 24, 2014 at 2:59 PM, John Lafitte 
> <[email protected]>wrote:
> 
>> I think the channel/image/title idea was probably wrong.  It looks like
>> the extra title field is actually the http header Content-Disposition:
>> inline; filename="jobexport.xml".  I can email you the url privately of the
>> specific RSS feed I'm using for this issue, but since it's a client site
>> I'm not sure I'm allowed to post it publicly.
>>
>> I'm using the default parser-plugins.xml which shows parse-tika before
>> feed.  I don't have feed in my plugin.includes, but if I modify
>> parser-plugins.xml and plugin.includes to try to favor the feed I still get
>> the same results.  I might be doing something wrong.
>>
>>
>>
>>
>> On Mon, Feb 24, 2014 at 2:20 PM, Sebastian Nagel <
>> [email protected]> wrote:
>>
>>> Hi John,
>>>
>>> can you attach an (short) example document to reproduce the problem?
>>> I was not able to reproduce it with the example in
>>> http://de.wikipedia.org/wiki/RSS
>>> which contains channel/image/title.
>>>
>>> Which parser plugin is used: "feed" or "parse-tika"?
>>> (In doubt, please, add the value of property "plugin.includes")
>>>
>>> Sebastian
>>>
>>>
>>> On 02/24/2014 08:31 PM, John Lafitte wrote:
>>>> I am using Nutch 1.7 and Solr 4.6.1.  I'm having a problem with indexing
>>>> RSS that has channel/title then channel/image/title it tries to add
>>> both of
>>>> them then fails when doing solrindex because title isn't multivalued.
>>>>
>>>> I've used nutch indexchecker and I see the two titles being returned.
>>>  The
>>>> extra title is the value that in the content-disposition: filename http
>>>> header.  I only see one title when I run nutch readseg.  So I'm a little
>>>> confused why it's
>>>>
>>>> I have made title multivalued in the solr schema and it seems to work
>>> that
>>>> way, but it seems wrong to me.  Documents shouldn't have more than one
>>>> title.  What is the correct way to fix this?
>>>>
>>>
>>>
>>
>

Re: multivalues returned unexpectedly

Reply via email to