I tried parsechecker to ensure that no value is retrieved to publishedDate.

On Wed, Jun 13, 2012 at 6:22 PM, Shameema Umer <[email protected]> wrote:

> Hi,
>
> I am trying for days to get a solution to retrive the <pubDate> value of a
> feed. Even the value is there on a feed, nutch is not parsing it and
> sending along with the outlinks.
>
> the feed plugin is included, but it is not populating value in the field
> publishedDate. Somebody please give me hints where I went wrong.
>
> Or please let me know if it is not possible.
>
> Thanks
> Shameema
>
>
> On Sat, Jun 9, 2012 at 4:13 PM, Shameema Umer <[email protected]> wrote:
>
>> Thanks Lewis.
>>
>>
>> On Sat, Jun 9, 2012 at 1:34 PM, Lewis John Mcgibbney <
>> [email protected]> wrote:
>>
>>> Hi Shameena,
>>>
>>> I think this depends directly on what tags/elements are within the
>>> feed(s). From the feeds I looked at yesterday the relevant tags
>>> appeared to be missing. I was surprised that Tika didn't pick up more
>>> so I think I'll head over and see exactly what the Tika 1.1 source
>>> looks like for the rss+xml parser.
>>>
>>> In the meantime the feed plugin packaged with Nutch WILL parse and
>>> index these additional fields if they are present, but will not if
>>> they are absent.
>>>
>>> Lewis
>>>
>>> On Fri, Jun 8, 2012 at 6:32 PM, Shameema Umer <[email protected]> wrote:
>>> > Hi Lewis, the things are clear, I am upset that I cannot find a means
>>> to
>>> > find the age of a web page by nutch. I thought publishedDate from the
>>> feed
>>> > plugin would help. If I change the field name from publishedDate to
>>> *pubDate
>>> > * . Will this help?
>>> >
>>> > Thanks
>>> > Shameema
>>> >
>>> >
>>> > On Fri, Jun 8, 2012 at 6:48 PM, Lewis John Mcgibbney <
>>> > [email protected]> wrote:
>>> >
>>> >> Hi,
>>> >>
>>> >> No This should not be necessary. The feed parser and accompanying
>>> >> indexingfilter should extract and send (to be indexed) the following
>>> >> metadata items
>>> >> Author, Tags, Pub;lished date, Updated date and feed,
>>> >>
>>> >> There is a problem though...
>>> >>
>>> >> With many feeds, including the bbci one you provided in another
>>> >> thread, many of these fields are absent, the parser and indexing
>>> >> filter cannot operate on our behalf and subsequently leaves these
>>> >> fields out.
>>> >>
>>> >> It is also important to note that in parse-plugins.xml we first try to
>>> >> parse the application/rss+xml mimetype with parse-tika before feed...
>>> >> I can only assume this is because parse-tika produces slightly better
>>> >> results for this mimetype. Let me explain
>>> >>
>>> >> With language identifier included and parse-plugins overridden to
>>> >> parse rss+xml solely with feed plugin I get
>>> >>
>>> >> lewismc@lewismc-HP-Mini-110-3100:~/ASF/trunk/runtime/local$ bin/nutch
>>> >> indexchecker http://feeds.feedburner.com/gov/GCC?format=xml
>>> >> fetching: http://feeds.feedburner.com/gov/GCC?format=xml
>>> >> parsing: http://feeds.feedburner.com/gov/GCC?format=xml
>>> >> contentType: application/rss+xml
>>> >> content :
>>> >> host :  feeds.feedburner.com
>>> >> tstamp :        Fri Jun 08 14:04:04 BST 2012
>>> >> lang :  unknown
>>> >> url :   http://feeds.feedburner.com/gov/GCC?format=xml
>>> >>
>>> >> however with parse-tika initiated and the same fetch I get
>>> >>
>>> >> lewismc@lewismc-HP-Mini-110-3100:~/ASF/trunk/runtime/local$ bin/nutch
>>> >> indexchecker http://feeds.feedburner.com/gov/GCC?format=xml
>>> >> fetching: http://feeds.feedburner.com/gov/GCC?format=xml
>>> >> parsing: http://feeds.feedburner.com/gov/GCC?format=xml
>>> >> contentType: application/rss+xml
>>> >> content :       Glasgow City Council - News Feed Glasgow City Council
>>> -
>>> >> News
>>> >> Feed Keep up to date with all the news
>>> >> title : Glasgow City Council - News Feed
>>> >> host :  feeds.feedburner.com
>>> >> tstamp :        Fri Jun 08 14:04:25 BST 2012
>>> >> lang :  en
>>> >> url :   http://feeds.feedburner.com/gov/GCC?format=xml
>>> >>
>>> >> Please note that this feed does not include info like publishedDate,
>>> >> updatedDate etc instead offering other means of expressing (some) of
>>> >> this information. In the above case, as the parse data is not present
>>> >> for the required feed fields, or for arguments sake parse-tika, these
>>> >> fields are not included in our subsequent index fields.
>>> >>
>>> >> I hope this clears things up a bit.
>>> >>
>>> >> On a sidenote, also some things to pick up from the above excepts from
>>> >> some tests;
>>> >> 1) Feed plugin fails to recognize content, title and lang fields where
>>> >> parse-tika does this sucessfully.
>>> >> 2) Even though parse-tika DOES utilise the language-identifier to
>>> >> recognize the lang field and provide a value, it fails to include the
>>> >> full value which should be lang="en-GB" as oppose to lang="en"
>>> >>
>>> >> Can anyone chime in on what the current state of affairs is with
>>> >> delegation of language detection to parse-tika, or whether this as
>>> >> already the case but needs patched to accommodate the scenario I
>>> >> provide above?
>>> >>
>>> >> Thanks
>>> >>
>>> >> Lewis
>>> >>
>>> >> On Fri, Jun 8, 2012 at 5:07 AM, Shameema Umer <[email protected]>
>>> wrote:
>>> >> > Hi Lewis,
>>> >> >
>>> >> > My solrindex-mapping contains
>>> >> > <mapping>
>>> >> >        <!-- Simple mapping of fields created by Nutch
>>> IndexingFilters
>>> >> >             to fields defined (and expected) in Solr schema.xml.
>>> >> >
>>> >> >             Any fields in NutchDocument that match a name defined
>>> >> >             in field/@source will be renamed to the corresponding
>>> >> >             field/@dest.
>>> >> >             Additionally, if a field name (before mapping) matches
>>> >> >             a copyField/@source then its values will be copied to
>>> >> >             the corresponding copyField/@dest.
>>> >> >
>>> >> >             uniqueKey has the same meaning as in Solr schema.xml
>>> >> >             and defaults to "id" if not defined.
>>> >> >         -->
>>> >> >        <fields>
>>> >> >                <field dest="content" source="content"/>
>>> >> >                <field dest="site" source="site"/>
>>> >> >                <field dest="title" source="title"/>
>>> >> >                <field dest="host" source="host"/>
>>> >> >                <field dest="segment" source="segment"/>
>>> >> >                <field dest="boost" source="boost"/>
>>> >> >                <field dest="digest" source="digest"/>
>>> >> >                <field dest="tstamp" source="tstamp"/>
>>> >> >                <field dest="publishedDate" source="publishedDate"/>
>>> >> >                <field dest="id" source="url"/>
>>> >> >                <copyField source="url" dest="url"/>
>>> >> >        </fields>
>>> >> >        <uniqueKey>id</uniqueKey>
>>> >> > </mapping>
>>> >> >
>>> >> >
>>> >> > Do I need to edit any source code of feed plugin to make available
>>> >> > this publishedDate.
>>> >> >
>>> >> > Thanks
>>> >> > Shameema
>>> >> >
>>> >> > On Thu, Jun 7, 2012 at 4:32 PM, Lewis John Mcgibbney
>>> >> > <[email protected]> wrote:
>>> >> >> Best way to test this is by doing ad-hoc parsechecker fetches. Also
>>> >> >> try including this value in your solr-mapping file.
>>> >> >>
>>> >> >> On Thu, Jun 7, 2012 at 11:41 AM, Shameema Umer <[email protected]>
>>> >> wrote:
>>> >> >>> In my schema there are certain fields used for feed plugin.
>>> >> >>>
>>> >> >>>        <!-- fields for feed plugin (tag is also used by
>>> >> >>> microformats-reltag)-->
>>> >> >>>        <field name="author" type="string" stored="true"
>>> >> indexed="true"/>
>>> >> >>>        <field name="tag" type="string" stored="true"
>>> indexed="true"
>>> >> >>> multiValued="true"/>
>>> >> >>>        <field name="feed" type="string" stored="true"
>>> indexed="true"/>
>>> >> >>>        <field name="publishedDate" type="date" stored="true"
>>> >> >>>            indexed="true"/>
>>> >> >>>        <field name="updatedDate" type="date" stored="true"
>>> >> >>>            indexed="true"/>
>>> >> >>>
>>> >> >>> I have included the feed plugin in nutch site xml. The feed file
>>> is
>>> >> fetched
>>> >> >>> and parsed , also the links in it are working properly. But I
>>> cannot
>>> >> get
>>> >> >>> the publishedDate working.
>>> >> >>> I cannot retrieve the publishedDate or sort by it.
>>> >> >>>
>>> >> >>> Please help.
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >> --
>>> >> >> Lewis
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Lewis
>>> >>
>>>
>>>
>>>
>>> --
>>> Lewis
>>>
>>
>>
>

Reply via email to