Re: multivalues returned unexpectedly

Sebastian Nagel Mon, 24 Feb 2014 14:02:47 -0800

> https://issues.apache.org/jira/browse/NUTCH-1140
Thanks for digging this up!


> Why is index-more adding this?
Maybe, to have some title for MIME types
which have no title (e.g., plain text).
That could be the intension.
The code is old (> 9 years) and the web
has changed since. The original
RFC http://www.ietf.org/rfc/rfc1806.txt
for the content-disposition header
is even older (1995).


On 02/24/2014 10:40 PM, John Lafitte wrote:
> Okay, I invoked it the way you mentioned and I get the same result.
>  However, I tried it without index-more included and I no longer have the
> additional title.  Why is index-more adding this?
> 
> 
> On Mon, Feb 24, 2014 at 3:24 PM, Sebastian Nagel <[email protected]
>> wrote:
> 
>>> I'm not sure I'm allowed to post it publicly.
>> A minimalistic and anonymized example would be fine.
>> However, if it's really the HTTP header it will
>> be hard to make it reproducible.
>>
>>> I'm using the default parser-plugins.xml which shows parse-tika before
>>> feed.  I don't have feed in my plugin.includes, but if I modify
>>> parser-plugins.xml and plugin.includes to try to favor the feed I still
>> get
>>> the same results.  I might be doing something wrong.
>>
>> It's possible to set plugin.includes (and other properties) just for
>> tools like indexchecker, parsechecker, etc:
>>
>> % bin/nutch indexchecker
>> -Dplugin.includes="feed|index-(basic|more)|protocol-http" .../rss.xml
>>
>>
>> On 02/24/2014 09:59 PM, John Lafitte wrote:
>>> I think the channel/image/title idea was probably wrong.  It looks like
>> the
>>> extra title field is actually the http header Content-Disposition:
>> inline;
>>> filename="jobexport.xml".  I can email you the url privately of the
>>> specific RSS feed I'm using for this issue, but since it's a client site
>>> I'm not sure I'm allowed to post it publicly.
>>>
>>> I'm using the default parser-plugins.xml which shows parse-tika before
>>> feed.  I don't have feed in my plugin.includes, but if I modify
>>> parser-plugins.xml and plugin.includes to try to favor the feed I still
>> get
>>> the same results.  I might be doing something wrong.
>>>
>>>
>>>
>>>
>>> On Mon, Feb 24, 2014 at 2:20 PM, Sebastian Nagel <
>> [email protected]
>>>> wrote:
>>>
>>>> Hi John,
>>>>
>>>> can you attach an (short) example document to reproduce the problem?
>>>> I was not able to reproduce it with the example in
>>>> http://de.wikipedia.org/wiki/RSS
>>>> which contains channel/image/title.
>>>>
>>>> Which parser plugin is used: "feed" or "parse-tika"?
>>>> (In doubt, please, add the value of property "plugin.includes")
>>>>
>>>> Sebastian
>>>>
>>>>
>>>> On 02/24/2014 08:31 PM, John Lafitte wrote:
>>>>> I am using Nutch 1.7 and Solr 4.6.1.  I'm having a problem with
>> indexing
>>>>> RSS that has channel/title then channel/image/title it tries to add
>> both
>>>> of
>>>>> them then fails when doing solrindex because title isn't multivalued.
>>>>>
>>>>> I've used nutch indexchecker and I see the two titles being returned.
>>>>  The
>>>>> extra title is the value that in the content-disposition: filename http
>>>>> header.  I only see one title when I run nutch readseg.  So I'm a
>> little
>>>>> confused why it's
>>>>>
>>>>> I have made title multivalued in the solr schema and it seems to work
>>>> that
>>>>> way, but it seems wrong to me.  Documents shouldn't have more than one
>>>>> title.  What is the correct way to fix this?
>>>>>
>>>>
>>>>
>>>
>>
>>
>

Re: multivalues returned unexpectedly

Reply via email to