Hi,

I am trying for days to get a solution to retrive the <pubDate> value of a
feed. Even the value is there on a feed, nutch is not parsing it and
sending along with the outlinks.

the feed plugin is included, but it is not populating value in the field
publishedDate. Somebody please give me hints where I went wrong.

Or please let me know if it is not possible.

Thanks
Shameema

On Sat, Jun 9, 2012 at 4:13 PM, Shameema Umer <[email protected]> wrote:

> Thanks Lewis.
>
>
> On Sat, Jun 9, 2012 at 1:34 PM, Lewis John Mcgibbney <
> [email protected]> wrote:
>
>> Hi Shameena,
>>
>> I think this depends directly on what tags/elements are within the
>> feed(s). From the feeds I looked at yesterday the relevant tags
>> appeared to be missing. I was surprised that Tika didn't pick up more
>> so I think I'll head over and see exactly what the Tika 1.1 source
>> looks like for the rss+xml parser.
>>
>> In the meantime the feed plugin packaged with Nutch WILL parse and
>> index these additional fields if they are present, but will not if
>> they are absent.
>>
>> Lewis
>>
>> On Fri, Jun 8, 2012 at 6:32 PM, Shameema Umer <[email protected]> wrote:
>> > Hi Lewis, the things are clear, I am upset that I cannot find a means to
>> > find the age of a web page by nutch. I thought publishedDate from the
>> feed
>> > plugin would help. If I change the field name from publishedDate to
>> *pubDate
>> > * . Will this help?
>> >
>> > Thanks
>> > Shameema
>> >
>> >
>> > On Fri, Jun 8, 2012 at 6:48 PM, Lewis John Mcgibbney <
>> > [email protected]> wrote:
>> >
>> >> Hi,
>> >>
>> >> No This should not be necessary. The feed parser and accompanying
>> >> indexingfilter should extract and send (to be indexed) the following
>> >> metadata items
>> >> Author, Tags, Pub;lished date, Updated date and feed,
>> >>
>> >> There is a problem though...
>> >>
>> >> With many feeds, including the bbci one you provided in another
>> >> thread, many of these fields are absent, the parser and indexing
>> >> filter cannot operate on our behalf and subsequently leaves these
>> >> fields out.
>> >>
>> >> It is also important to note that in parse-plugins.xml we first try to
>> >> parse the application/rss+xml mimetype with parse-tika before feed...
>> >> I can only assume this is because parse-tika produces slightly better
>> >> results for this mimetype. Let me explain
>> >>
>> >> With language identifier included and parse-plugins overridden to
>> >> parse rss+xml solely with feed plugin I get
>> >>
>> >> lewismc@lewismc-HP-Mini-110-3100:~/ASF/trunk/runtime/local$ bin/nutch
>> >> indexchecker http://feeds.feedburner.com/gov/GCC?format=xml
>> >> fetching: http://feeds.feedburner.com/gov/GCC?format=xml
>> >> parsing: http://feeds.feedburner.com/gov/GCC?format=xml
>> >> contentType: application/rss+xml
>> >> content :
>> >> host :  feeds.feedburner.com
>> >> tstamp :        Fri Jun 08 14:04:04 BST 2012
>> >> lang :  unknown
>> >> url :   http://feeds.feedburner.com/gov/GCC?format=xml
>> >>
>> >> however with parse-tika initiated and the same fetch I get
>> >>
>> >> lewismc@lewismc-HP-Mini-110-3100:~/ASF/trunk/runtime/local$ bin/nutch
>> >> indexchecker http://feeds.feedburner.com/gov/GCC?format=xml
>> >> fetching: http://feeds.feedburner.com/gov/GCC?format=xml
>> >> parsing: http://feeds.feedburner.com/gov/GCC?format=xml
>> >> contentType: application/rss+xml
>> >> content :       Glasgow City Council - News Feed Glasgow City Council -
>> >> News
>> >> Feed Keep up to date with all the news
>> >> title : Glasgow City Council - News Feed
>> >> host :  feeds.feedburner.com
>> >> tstamp :        Fri Jun 08 14:04:25 BST 2012
>> >> lang :  en
>> >> url :   http://feeds.feedburner.com/gov/GCC?format=xml
>> >>
>> >> Please note that this feed does not include info like publishedDate,
>> >> updatedDate etc instead offering other means of expressing (some) of
>> >> this information. In the above case, as the parse data is not present
>> >> for the required feed fields, or for arguments sake parse-tika, these
>> >> fields are not included in our subsequent index fields.
>> >>
>> >> I hope this clears things up a bit.
>> >>
>> >> On a sidenote, also some things to pick up from the above excepts from
>> >> some tests;
>> >> 1) Feed plugin fails to recognize content, title and lang fields where
>> >> parse-tika does this sucessfully.
>> >> 2) Even though parse-tika DOES utilise the language-identifier to
>> >> recognize the lang field and provide a value, it fails to include the
>> >> full value which should be lang="en-GB" as oppose to lang="en"
>> >>
>> >> Can anyone chime in on what the current state of affairs is with
>> >> delegation of language detection to parse-tika, or whether this as
>> >> already the case but needs patched to accommodate the scenario I
>> >> provide above?
>> >>
>> >> Thanks
>> >>
>> >> Lewis
>> >>
>> >> On Fri, Jun 8, 2012 at 5:07 AM, Shameema Umer <[email protected]>
>> wrote:
>> >> > Hi Lewis,
>> >> >
>> >> > My solrindex-mapping contains
>> >> > <mapping>
>> >> >        <!-- Simple mapping of fields created by Nutch IndexingFilters
>> >> >             to fields defined (and expected) in Solr schema.xml.
>> >> >
>> >> >             Any fields in NutchDocument that match a name defined
>> >> >             in field/@source will be renamed to the corresponding
>> >> >             field/@dest.
>> >> >             Additionally, if a field name (before mapping) matches
>> >> >             a copyField/@source then its values will be copied to
>> >> >             the corresponding copyField/@dest.
>> >> >
>> >> >             uniqueKey has the same meaning as in Solr schema.xml
>> >> >             and defaults to "id" if not defined.
>> >> >         -->
>> >> >        <fields>
>> >> >                <field dest="content" source="content"/>
>> >> >                <field dest="site" source="site"/>
>> >> >                <field dest="title" source="title"/>
>> >> >                <field dest="host" source="host"/>
>> >> >                <field dest="segment" source="segment"/>
>> >> >                <field dest="boost" source="boost"/>
>> >> >                <field dest="digest" source="digest"/>
>> >> >                <field dest="tstamp" source="tstamp"/>
>> >> >                <field dest="publishedDate" source="publishedDate"/>
>> >> >                <field dest="id" source="url"/>
>> >> >                <copyField source="url" dest="url"/>
>> >> >        </fields>
>> >> >        <uniqueKey>id</uniqueKey>
>> >> > </mapping>
>> >> >
>> >> >
>> >> > Do I need to edit any source code of feed plugin to make available
>> >> > this publishedDate.
>> >> >
>> >> > Thanks
>> >> > Shameema
>> >> >
>> >> > On Thu, Jun 7, 2012 at 4:32 PM, Lewis John Mcgibbney
>> >> > <[email protected]> wrote:
>> >> >> Best way to test this is by doing ad-hoc parsechecker fetches. Also
>> >> >> try including this value in your solr-mapping file.
>> >> >>
>> >> >> On Thu, Jun 7, 2012 at 11:41 AM, Shameema Umer <[email protected]>
>> >> wrote:
>> >> >>> In my schema there are certain fields used for feed plugin.
>> >> >>>
>> >> >>>        <!-- fields for feed plugin (tag is also used by
>> >> >>> microformats-reltag)-->
>> >> >>>        <field name="author" type="string" stored="true"
>> >> indexed="true"/>
>> >> >>>        <field name="tag" type="string" stored="true" indexed="true"
>> >> >>> multiValued="true"/>
>> >> >>>        <field name="feed" type="string" stored="true"
>> indexed="true"/>
>> >> >>>        <field name="publishedDate" type="date" stored="true"
>> >> >>>            indexed="true"/>
>> >> >>>        <field name="updatedDate" type="date" stored="true"
>> >> >>>            indexed="true"/>
>> >> >>>
>> >> >>> I have included the feed plugin in nutch site xml. The feed file is
>> >> fetched
>> >> >>> and parsed , also the links in it are working properly. But I
>> cannot
>> >> get
>> >> >>> the publishedDate working.
>> >> >>> I cannot retrieve the publishedDate or sort by it.
>> >> >>>
>> >> >>> Please help.
>> >> >>
>> >> >>
>> >> >>
>> >> >> --
>> >> >> Lewis
>> >>
>> >>
>> >>
>> >> --
>> >> Lewis
>> >>
>>
>>
>>
>> --
>> Lewis
>>
>
>

Reply via email to