Depending on what the tag looks like it will be interpreted
accordingly by the feed parser.
My instincts are that there is a different between pubDate and
publishedDate being parsed and identified by the parser, however then
the question arises as to how/why the field is not identified as a
tag.

I will try to do more digging.. it might be worth looking at the feed
source as well.

 Best
Lewis

On Thu, Jun 14, 2012 at 7:04 AM, Shameema Umer <[email protected]> wrote:
> Hi Lewis,
>
> The feed you provided http://feeds.feedburner.com/gov/GCC?format=xml has
> the pubDate tag.
> Then why is it not parsed. Please explain.
>
> What i need is the value of the pubDate
> pulled to any of our date fields.
>
> Thanks
> Shameema
>
>
>
> On Wed, Jun 13, 2012 at 6:28 PM, Shameema Umer <[email protected]> wrote:
>
>> I tried parsechecker to ensure that no value is retrieved to publishedDate.
>>
>>
>> On Wed, Jun 13, 2012 at 6:22 PM, Shameema Umer <[email protected]> wrote:
>>
>>> Hi,
>>>
>>> I am trying for days to get a solution to retrive the <pubDate> value of
>>> a feed. Even the value is there on a feed, nutch is not parsing it and
>>> sending along with the outlinks.
>>>
>>> the feed plugin is included, but it is not populating value in the field
>>> publishedDate. Somebody please give me hints where I went wrong.
>>>
>>> Or please let me know if it is not possible.
>>>
>>> Thanks
>>> Shameema
>>>
>>>
>>> On Sat, Jun 9, 2012 at 4:13 PM, Shameema Umer <[email protected]> wrote:
>>>
>>>> Thanks Lewis.
>>>>
>>>>
>>>> On Sat, Jun 9, 2012 at 1:34 PM, Lewis John Mcgibbney <
>>>> [email protected]> wrote:
>>>>
>>>>> Hi Shameena,
>>>>>
>>>>> I think this depends directly on what tags/elements are within the
>>>>> feed(s). From the feeds I looked at yesterday the relevant tags
>>>>> appeared to be missing. I was surprised that Tika didn't pick up more
>>>>> so I think I'll head over and see exactly what the Tika 1.1 source
>>>>> looks like for the rss+xml parser.
>>>>>
>>>>> In the meantime the feed plugin packaged with Nutch WILL parse and
>>>>> index these additional fields if they are present, but will not if
>>>>> they are absent.
>>>>>
>>>>> Lewis
>>>>>
>>>>> On Fri, Jun 8, 2012 at 6:32 PM, Shameema Umer <[email protected]>
>>>>> wrote:
>>>>> > Hi Lewis, the things are clear, I am upset that I cannot find a means
>>>>> to
>>>>> > find the age of a web page by nutch. I thought publishedDate from the
>>>>> feed
>>>>> > plugin would help. If I change the field name from publishedDate to
>>>>> *pubDate
>>>>> > * . Will this help?
>>>>> >
>>>>> > Thanks
>>>>> > Shameema
>>>>> >
>>>>> >
>>>>> > On Fri, Jun 8, 2012 at 6:48 PM, Lewis John Mcgibbney <
>>>>> > [email protected]> wrote:
>>>>> >
>>>>> >> Hi,
>>>>> >>
>>>>> >> No This should not be necessary. The feed parser and accompanying
>>>>> >> indexingfilter should extract and send (to be indexed) the following
>>>>> >> metadata items
>>>>> >> Author, Tags, Pub;lished date, Updated date and feed,
>>>>> >>
>>>>> >> There is a problem though...
>>>>> >>
>>>>> >> With many feeds, including the bbci one you provided in another
>>>>> >> thread, many of these fields are absent, the parser and indexing
>>>>> >> filter cannot operate on our behalf and subsequently leaves these
>>>>> >> fields out.
>>>>> >>
>>>>> >> It is also important to note that in parse-plugins.xml we first try
>>>>> to
>>>>> >> parse the application/rss+xml mimetype with parse-tika before feed...
>>>>> >> I can only assume this is because parse-tika produces slightly better
>>>>> >> results for this mimetype. Let me explain
>>>>> >>
>>>>> >> With language identifier included and parse-plugins overridden to
>>>>> >> parse rss+xml solely with feed plugin I get
>>>>> >>
>>>>> >> lewismc@lewismc-HP-Mini-110-3100:~/ASF/trunk/runtime/local$
>>>>> bin/nutch
>>>>> >> indexchecker http://feeds.feedburner.com/gov/GCC?format=xml
>>>>> >> fetching: http://feeds.feedburner.com/gov/GCC?format=xml
>>>>> >> parsing: http://feeds.feedburner.com/gov/GCC?format=xml
>>>>> >> contentType: application/rss+xml
>>>>> >> content :
>>>>> >> host :  feeds.feedburner.com
>>>>> >> tstamp :        Fri Jun 08 14:04:04 BST 2012
>>>>> >> lang :  unknown
>>>>> >> url :   http://feeds.feedburner.com/gov/GCC?format=xml
>>>>> >>
>>>>> >> however with parse-tika initiated and the same fetch I get
>>>>> >>
>>>>> >> lewismc@lewismc-HP-Mini-110-3100:~/ASF/trunk/runtime/local$
>>>>> bin/nutch
>>>>> >> indexchecker http://feeds.feedburner.com/gov/GCC?format=xml
>>>>> >> fetching: http://feeds.feedburner.com/gov/GCC?format=xml
>>>>> >> parsing: http://feeds.feedburner.com/gov/GCC?format=xml
>>>>> >> contentType: application/rss+xml
>>>>> >> content :       Glasgow City Council - News Feed Glasgow City
>>>>> Council -
>>>>> >> News
>>>>> >> Feed Keep up to date with all the news
>>>>> >> title : Glasgow City Council - News Feed
>>>>> >> host :  feeds.feedburner.com
>>>>> >> tstamp :        Fri Jun 08 14:04:25 BST 2012
>>>>> >> lang :  en
>>>>> >> url :   http://feeds.feedburner.com/gov/GCC?format=xml
>>>>> >>
>>>>> >> Please note that this feed does not include info like publishedDate,
>>>>> >> updatedDate etc instead offering other means of expressing (some) of
>>>>> >> this information. In the above case, as the parse data is not present
>>>>> >> for the required feed fields, or for arguments sake parse-tika, these
>>>>> >> fields are not included in our subsequent index fields.
>>>>> >>
>>>>> >> I hope this clears things up a bit.
>>>>> >>
>>>>> >> On a sidenote, also some things to pick up from the above excepts
>>>>> from
>>>>> >> some tests;
>>>>> >> 1) Feed plugin fails to recognize content, title and lang fields
>>>>> where
>>>>> >> parse-tika does this sucessfully.
>>>>> >> 2) Even though parse-tika DOES utilise the language-identifier to
>>>>> >> recognize the lang field and provide a value, it fails to include the
>>>>> >> full value which should be lang="en-GB" as oppose to lang="en"
>>>>> >>
>>>>> >> Can anyone chime in on what the current state of affairs is with
>>>>> >> delegation of language detection to parse-tika, or whether this as
>>>>> >> already the case but needs patched to accommodate the scenario I
>>>>> >> provide above?
>>>>> >>
>>>>> >> Thanks
>>>>> >>
>>>>> >> Lewis
>>>>> >>
>>>>> >> On Fri, Jun 8, 2012 at 5:07 AM, Shameema Umer <[email protected]>
>>>>> wrote:
>>>>> >> > Hi Lewis,
>>>>> >> >
>>>>> >> > My solrindex-mapping contains
>>>>> >> > <mapping>
>>>>> >> >        <!-- Simple mapping of fields created by Nutch
>>>>> IndexingFilters
>>>>> >> >             to fields defined (and expected) in Solr schema.xml.
>>>>> >> >
>>>>> >> >             Any fields in NutchDocument that match a name defined
>>>>> >> >             in field/@source will be renamed to the corresponding
>>>>> >> >             field/@dest.
>>>>> >> >             Additionally, if a field name (before mapping) matches
>>>>> >> >             a copyField/@source then its values will be copied to
>>>>> >> >             the corresponding copyField/@dest.
>>>>> >> >
>>>>> >> >             uniqueKey has the same meaning as in Solr schema.xml
>>>>> >> >             and defaults to "id" if not defined.
>>>>> >> >         -->
>>>>> >> >        <fields>
>>>>> >> >                <field dest="content" source="content"/>
>>>>> >> >                <field dest="site" source="site"/>
>>>>> >> >                <field dest="title" source="title"/>
>>>>> >> >                <field dest="host" source="host"/>
>>>>> >> >                <field dest="segment" source="segment"/>
>>>>> >> >                <field dest="boost" source="boost"/>
>>>>> >> >                <field dest="digest" source="digest"/>
>>>>> >> >                <field dest="tstamp" source="tstamp"/>
>>>>> >> >                <field dest="publishedDate" source="publishedDate"/>
>>>>> >> >                <field dest="id" source="url"/>
>>>>> >> >                <copyField source="url" dest="url"/>
>>>>> >> >        </fields>
>>>>> >> >        <uniqueKey>id</uniqueKey>
>>>>> >> > </mapping>
>>>>> >> >
>>>>> >> >
>>>>> >> > Do I need to edit any source code of feed plugin to make available
>>>>> >> > this publishedDate.
>>>>> >> >
>>>>> >> > Thanks
>>>>> >> > Shameema
>>>>> >> >
>>>>> >> > On Thu, Jun 7, 2012 at 4:32 PM, Lewis John Mcgibbney
>>>>> >> > <[email protected]> wrote:
>>>>> >> >> Best way to test this is by doing ad-hoc parsechecker fetches.
>>>>> Also
>>>>> >> >> try including this value in your solr-mapping file.
>>>>> >> >>
>>>>> >> >> On Thu, Jun 7, 2012 at 11:41 AM, Shameema Umer <[email protected]
>>>>> >
>>>>> >> wrote:
>>>>> >> >>> In my schema there are certain fields used for feed plugin.
>>>>> >> >>>
>>>>> >> >>>        <!-- fields for feed plugin (tag is also used by
>>>>> >> >>> microformats-reltag)-->
>>>>> >> >>>        <field name="author" type="string" stored="true"
>>>>> >> indexed="true"/>
>>>>> >> >>>        <field name="tag" type="string" stored="true"
>>>>> indexed="true"
>>>>> >> >>> multiValued="true"/>
>>>>> >> >>>        <field name="feed" type="string" stored="true"
>>>>> indexed="true"/>
>>>>> >> >>>        <field name="publishedDate" type="date" stored="true"
>>>>> >> >>>            indexed="true"/>
>>>>> >> >>>        <field name="updatedDate" type="date" stored="true"
>>>>> >> >>>            indexed="true"/>
>>>>> >> >>>
>>>>> >> >>> I have included the feed plugin in nutch site xml. The feed file
>>>>> is
>>>>> >> fetched
>>>>> >> >>> and parsed , also the links in it are working properly. But I
>>>>> cannot
>>>>> >> get
>>>>> >> >>> the publishedDate working.
>>>>> >> >>> I cannot retrieve the publishedDate or sort by it.
>>>>> >> >>>
>>>>> >> >>> Please help.
>>>>> >> >>
>>>>> >> >>
>>>>> >> >>
>>>>> >> >> --
>>>>> >> >> Lewis
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> --
>>>>> >> Lewis
>>>>> >>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Lewis
>>>>>
>>>>
>>>>
>>>
>>



-- 
Lewis

Reply via email to