Re: publishedDate and feed plugin

Lewis John Mcgibbney Fri, 08 Jun 2012 06:19:22 -0700

Hi,

No This should not be necessary. The feed parser and accompanying
indexingfilter should extract and send (to be indexed) the following
metadata items
Author, Tags, Pub;lished date, Updated date and feed,


There is a problem though...

With many feeds, including the bbci one you provided in another
thread, many of these fields are absent, the parser and indexing
filter cannot operate on our behalf and subsequently leaves these
fields out.

It is also important to note that in parse-plugins.xml we first try to
parse the application/rss+xml mimetype with parse-tika before feed...
I can only assume this is because parse-tika produces slightly better
results for this mimetype. Let me explain

With language identifier included and parse-plugins overridden to
parse rss+xml solely with feed plugin I get

lewismc@lewismc-HP-Mini-110-3100:~/ASF/trunk/runtime/local$ bin/nutch
indexchecker http://feeds.feedburner.com/gov/GCC?format=xml
fetching: http://feeds.feedburner.com/gov/GCC?format=xml
parsing: http://feeds.feedburner.com/gov/GCC?format=xml
contentType: application/rss+xml
content :       
host :  feeds.feedburner.com
tstamp :        Fri Jun 08 14:04:04 BST 2012
lang :  unknown
url :   http://feeds.feedburner.com/gov/GCC?format=xml

however with parse-tika initiated and the same fetch I get

lewismc@lewismc-HP-Mini-110-3100:~/ASF/trunk/runtime/local$ bin/nutch
indexchecker http://feeds.feedburner.com/gov/GCC?format=xml
fetching: http://feeds.feedburner.com/gov/GCC?format=xml
parsing: http://feeds.feedburner.com/gov/GCC?format=xml
contentType: application/rss+xml
content :       Glasgow City Council - News Feed Glasgow City Council - News
Feed Keep up to date with all the news
title : Glasgow City Council - News Feed
host :  feeds.feedburner.com
tstamp :        Fri Jun 08 14:04:25 BST 2012
lang :  en
url :   http://feeds.feedburner.com/gov/GCC?format=xml

Please note that this feed does not include info like publishedDate,
updatedDate etc instead offering other means of expressing (some) of
this information. In the above case, as the parse data is not present
for the required feed fields, or for arguments sake parse-tika, these
fields are not included in our subsequent index fields.

I hope this clears things up a bit.

On a sidenote, also some things to pick up from the above excepts from
some tests;
1) Feed plugin fails to recognize content, title and lang fields where
parse-tika does this sucessfully.
2) Even though parse-tika DOES utilise the language-identifier to
recognize the lang field and provide a value, it fails to include the
full value which should be lang="en-GB" as oppose to lang="en"

Can anyone chime in on what the current state of affairs is with
delegation of language detection to parse-tika, or whether this as
already the case but needs patched to accommodate the scenario I
provide above?

Thanks

Lewis

On Fri, Jun 8, 2012 at 5:07 AM, Shameema Umer <[email protected]> wrote:
> Hi Lewis,
>
> My solrindex-mapping contains
> <mapping>
>        <!-- Simple mapping of fields created by Nutch IndexingFilters
>             to fields defined (and expected) in Solr schema.xml.
>
>             Any fields in NutchDocument that match a name defined
>             in field/@source will be renamed to the corresponding
>             field/@dest.
>             Additionally, if a field name (before mapping) matches
>             a copyField/@source then its values will be copied to
>             the corresponding copyField/@dest.
>
>             uniqueKey has the same meaning as in Solr schema.xml
>             and defaults to "id" if not defined.
>         -->
>        <fields>
>                <field dest="content" source="content"/>
>                <field dest="site" source="site"/>
>                <field dest="title" source="title"/>
>                <field dest="host" source="host"/>
>                <field dest="segment" source="segment"/>
>                <field dest="boost" source="boost"/>
>                <field dest="digest" source="digest"/>
>                <field dest="tstamp" source="tstamp"/>
>                <field dest="publishedDate" source="publishedDate"/>
>                <field dest="id" source="url"/>
>                <copyField source="url" dest="url"/>
>        </fields>
>        <uniqueKey>id</uniqueKey>
> </mapping>
>
>
> Do I need to edit any source code of feed plugin to make available
> this publishedDate.
>
> Thanks
> Shameema
>
> On Thu, Jun 7, 2012 at 4:32 PM, Lewis John Mcgibbney
> <[email protected]> wrote:
>> Best way to test this is by doing ad-hoc parsechecker fetches. Also
>> try including this value in your solr-mapping file.
>>
>> On Thu, Jun 7, 2012 at 11:41 AM, Shameema Umer <[email protected]> wrote:
>>> In my schema there are certain fields used for feed plugin.
>>>
>>>        <!-- fields for feed plugin (tag is also used by
>>> microformats-reltag)-->
>>>        <field name="author" type="string" stored="true" indexed="true"/>
>>>        <field name="tag" type="string" stored="true" indexed="true"
>>> multiValued="true"/>
>>>        <field name="feed" type="string" stored="true" indexed="true"/>
>>>        <field name="publishedDate" type="date" stored="true"
>>>            indexed="true"/>
>>>        <field name="updatedDate" type="date" stored="true"
>>>            indexed="true"/>
>>>
>>> I have included the feed plugin in nutch site xml. The feed file is fetched
>>> and parsed , also the links in it are working properly. But I cannot get
>>> the publishedDate working.
>>> I cannot retrieve the publishedDate or sort by it.
>>>
>>> Please help.
>>
>>
>>
>> --
>> Lewis



-- 
Lewis

Re: publishedDate and feed plugin

Reply via email to