Re: multivalues returned unexpectedly

Sebastian Nagel Mon, 24 Feb 2014 13:25:36 -0800

> I'm not sure I'm allowed to post it publicly.
A minimalistic and anonymized example would be fine.
However, if it's really the HTTP header it will
be hard to make it reproducible.


> I'm using the default parser-plugins.xml which shows parse-tika before
> feed.  I don't have feed in my plugin.includes, but if I modify
> parser-plugins.xml and plugin.includes to try to favor the feed I still get
> the same results.  I might be doing something wrong.

It's possible to set plugin.includes (and other properties) just for
tools like indexchecker, parsechecker, etc:

% bin/nutch indexchecker 
-Dplugin.includes="feed|index-(basic|more)|protocol-http" .../rss.xml


On 02/24/2014 09:59 PM, John Lafitte wrote:
> I think the channel/image/title idea was probably wrong.  It looks like the
> extra title field is actually the http header Content-Disposition: inline;
> filename="jobexport.xml".  I can email you the url privately of the
> specific RSS feed I'm using for this issue, but since it's a client site
> I'm not sure I'm allowed to post it publicly.
> 
> I'm using the default parser-plugins.xml which shows parse-tika before
> feed.  I don't have feed in my plugin.includes, but if I modify
> parser-plugins.xml and plugin.includes to try to favor the feed I still get
> the same results.  I might be doing something wrong.
> 
> 
> 
> 
> On Mon, Feb 24, 2014 at 2:20 PM, Sebastian Nagel <[email protected]
>> wrote:
> 
>> Hi John,
>>
>> can you attach an (short) example document to reproduce the problem?
>> I was not able to reproduce it with the example in
>> http://de.wikipedia.org/wiki/RSS
>> which contains channel/image/title.
>>
>> Which parser plugin is used: "feed" or "parse-tika"?
>> (In doubt, please, add the value of property "plugin.includes")
>>
>> Sebastian
>>
>>
>> On 02/24/2014 08:31 PM, John Lafitte wrote:
>>> I am using Nutch 1.7 and Solr 4.6.1.  I'm having a problem with indexing
>>> RSS that has channel/title then channel/image/title it tries to add both
>> of
>>> them then fails when doing solrindex because title isn't multivalued.
>>>
>>> I've used nutch indexchecker and I see the two titles being returned.
>>  The
>>> extra title is the value that in the content-disposition: filename http
>>> header.  I only see one title when I run nutch readseg.  So I'm a little
>>> confused why it's
>>>
>>> I have made title multivalued in the solr schema and it seems to work
>> that
>>> way, but it seems wrong to me.  Documents shouldn't have more than one
>>> title.  What is the correct way to fix this?
>>>
>>
>>
>

Re: multivalues returned unexpectedly

Reply via email to