Here is an example of the feed:

http://www.microsoft-careers.com/feeds/microsoft%20job2web/?src=RSS

bin/nutch indexchecker
http://www.microsoft-careers.com/feeds/microsoft%20job2web/?src=RSS

It returns:
title : Microsoft - Custom Search microsoft-job2web
title : jobexport.xml


On Mon, Feb 24, 2014 at 2:59 PM, John Lafitte <[email protected]>wrote:

> I think the channel/image/title idea was probably wrong.  It looks like
> the extra title field is actually the http header Content-Disposition:
> inline; filename="jobexport.xml".  I can email you the url privately of the
> specific RSS feed I'm using for this issue, but since it's a client site
> I'm not sure I'm allowed to post it publicly.
>
> I'm using the default parser-plugins.xml which shows parse-tika before
> feed.  I don't have feed in my plugin.includes, but if I modify
> parser-plugins.xml and plugin.includes to try to favor the feed I still get
> the same results.  I might be doing something wrong.
>
>
>
>
> On Mon, Feb 24, 2014 at 2:20 PM, Sebastian Nagel <
> [email protected]> wrote:
>
>> Hi John,
>>
>> can you attach an (short) example document to reproduce the problem?
>> I was not able to reproduce it with the example in
>> http://de.wikipedia.org/wiki/RSS
>> which contains channel/image/title.
>>
>> Which parser plugin is used: "feed" or "parse-tika"?
>> (In doubt, please, add the value of property "plugin.includes")
>>
>> Sebastian
>>
>>
>> On 02/24/2014 08:31 PM, John Lafitte wrote:
>> > I am using Nutch 1.7 and Solr 4.6.1.  I'm having a problem with indexing
>> > RSS that has channel/title then channel/image/title it tries to add
>> both of
>> > them then fails when doing solrindex because title isn't multivalued.
>> >
>> > I've used nutch indexchecker and I see the two titles being returned.
>>  The
>> > extra title is the value that in the content-disposition: filename http
>> > header.  I only see one title when I run nutch readseg.  So I'm a little
>> > confused why it's
>> >
>> > I have made title multivalued in the solr schema and it seems to work
>> that
>> > way, but it seems wrong to me.  Documents shouldn't have more than one
>> > title.  What is the correct way to fix this?
>> >
>>
>>
>

Reply via email to