S please. please explore why the tag pubDate is not parsed and indexed?

Thanks
Shameema

On Thu, Jun 14, 2012 at 6:11 PM, Lewis John Mcgibbney <
[email protected]> wrote:

> Depending on what the tag looks like it will be interpreted
> accordingly by the feed parser.
> My instincts are that there is a different between pubDate and
> publishedDate being parsed and identified by the parser, however then
> the question arises as to how/why the field is not identified as a
> tag.
>
> I will try to do more digging.. it might be worth looking at the feed
> source as well.
>
>  Best
> Lewis
>
> On Thu, Jun 14, 2012 at 7:04 AM, Shameema Umer <[email protected]> wrote:
> > Hi Lewis,
> >
> > The feed you provided http://feeds.feedburner.com/gov/GCC?format=xml has
> > the pubDate tag.
> > Then why is it not parsed. Please explain.
> >
> > What i need is the value of the pubDate
> > pulled to any of our date fields.
> >
> > Thanks
> > Shameema
> >
> >
> >
> > On Wed, Jun 13, 2012 at 6:28 PM, Shameema Umer <[email protected]>
> wrote:
> >
> >> I tried parsechecker to ensure that no value is retrieved to
> publishedDate.
> >>
> >>
> >> On Wed, Jun 13, 2012 at 6:22 PM, Shameema Umer <[email protected]>
> wrote:
> >>
> >>> Hi,
> >>>
> >>> I am trying for days to get a solution to retrive the <pubDate> value
> of
> >>> a feed. Even the value is there on a feed, nutch is not parsing it and
> >>> sending along with the outlinks.
> >>>
> >>> the feed plugin is included, but it is not populating value in the
> field
> >>> publishedDate. Somebody please give me hints where I went wrong.
> >>>
> >>> Or please let me know if it is not possible.
> >>>
> >>> Thanks
> >>> Shameema
> >>>
> >>>
> >>> On Sat, Jun 9, 2012 at 4:13 PM, Shameema Umer <[email protected]>
> wrote:
> >>>
> >>>> Thanks Lewis.
> >>>>
> >>>>
> >>>> On Sat, Jun 9, 2012 at 1:34 PM, Lewis John Mcgibbney <
> >>>> [email protected]> wrote:
> >>>>
> >>>>> Hi Shameena,
> >>>>>
> >>>>> I think this depends directly on what tags/elements are within the
> >>>>> feed(s). From the feeds I looked at yesterday the relevant tags
> >>>>> appeared to be missing. I was surprised that Tika didn't pick up more
> >>>>> so I think I'll head over and see exactly what the Tika 1.1 source
> >>>>> looks like for the rss+xml parser.
> >>>>>
> >>>>> In the meantime the feed plugin packaged with Nutch WILL parse and
> >>>>> index these additional fields if they are present, but will not if
> >>>>> they are absent.
> >>>>>
> >>>>> Lewis
> >>>>>
> >>>>> On Fri, Jun 8, 2012 at 6:32 PM, Shameema Umer <[email protected]>
> >>>>> wrote:
> >>>>> > Hi Lewis, the things are clear, I am upset that I cannot find a
> means
> >>>>> to
> >>>>> > find the age of a web page by nutch. I thought publishedDate from
> the
> >>>>> feed
> >>>>> > plugin would help. If I change the field name from publishedDate to
> >>>>> *pubDate
> >>>>> > * . Will this help?
> >>>>> >
> >>>>> > Thanks
> >>>>> > Shameema
> >>>>> >
> >>>>> >
> >>>>> > On Fri, Jun 8, 2012 at 6:48 PM, Lewis John Mcgibbney <
> >>>>> > [email protected]> wrote:
> >>>>> >
> >>>>> >> Hi,
> >>>>> >>
> >>>>> >> No This should not be necessary. The feed parser and accompanying
> >>>>> >> indexingfilter should extract and send (to be indexed) the
> following
> >>>>> >> metadata items
> >>>>> >> Author, Tags, Pub;lished date, Updated date and feed,
> >>>>> >>
> >>>>> >> There is a problem though...
> >>>>> >>
> >>>>> >> With many feeds, including the bbci one you provided in another
> >>>>> >> thread, many of these fields are absent, the parser and indexing
> >>>>> >> filter cannot operate on our behalf and subsequently leaves these
> >>>>> >> fields out.
> >>>>> >>
> >>>>> >> It is also important to note that in parse-plugins.xml we first
> try
> >>>>> to
> >>>>> >> parse the application/rss+xml mimetype with parse-tika before
> feed...
> >>>>> >> I can only assume this is because parse-tika produces slightly
> better
> >>>>> >> results for this mimetype. Let me explain
> >>>>> >>
> >>>>> >> With language identifier included and parse-plugins overridden to
> >>>>> >> parse rss+xml solely with feed plugin I get
> >>>>> >>
> >>>>> >> lewismc@lewismc-HP-Mini-110-3100:~/ASF/trunk/runtime/local$
> >>>>> bin/nutch
> >>>>> >> indexchecker http://feeds.feedburner.com/gov/GCC?format=xml
> >>>>> >> fetching: http://feeds.feedburner.com/gov/GCC?format=xml
> >>>>> >> parsing: http://feeds.feedburner.com/gov/GCC?format=xml
> >>>>> >> contentType: application/rss+xml
> >>>>> >> content :
> >>>>> >> host :  feeds.feedburner.com
> >>>>> >> tstamp :        Fri Jun 08 14:04:04 BST 2012
> >>>>> >> lang :  unknown
> >>>>> >> url :   http://feeds.feedburner.com/gov/GCC?format=xml
> >>>>> >>
> >>>>> >> however with parse-tika initiated and the same fetch I get
> >>>>> >>
> >>>>> >> lewismc@lewismc-HP-Mini-110-3100:~/ASF/trunk/runtime/local$
> >>>>> bin/nutch
> >>>>> >> indexchecker http://feeds.feedburner.com/gov/GCC?format=xml
> >>>>> >> fetching: http://feeds.feedburner.com/gov/GCC?format=xml
> >>>>> >> parsing: http://feeds.feedburner.com/gov/GCC?format=xml
> >>>>> >> contentType: application/rss+xml
> >>>>> >> content :       Glasgow City Council - News Feed Glasgow City
> >>>>> Council -
> >>>>> >> News
> >>>>> >> Feed Keep up to date with all the news
> >>>>> >> title : Glasgow City Council - News Feed
> >>>>> >> host :  feeds.feedburner.com
> >>>>> >> tstamp :        Fri Jun 08 14:04:25 BST 2012
> >>>>> >> lang :  en
> >>>>> >> url :   http://feeds.feedburner.com/gov/GCC?format=xml
> >>>>> >>
> >>>>> >> Please note that this feed does not include info like
> publishedDate,
> >>>>> >> updatedDate etc instead offering other means of expressing (some)
> of
> >>>>> >> this information. In the above case, as the parse data is not
> present
> >>>>> >> for the required feed fields, or for arguments sake parse-tika,
> these
> >>>>> >> fields are not included in our subsequent index fields.
> >>>>> >>
> >>>>> >> I hope this clears things up a bit.
> >>>>> >>
> >>>>> >> On a sidenote, also some things to pick up from the above excepts
> >>>>> from
> >>>>> >> some tests;
> >>>>> >> 1) Feed plugin fails to recognize content, title and lang fields
> >>>>> where
> >>>>> >> parse-tika does this sucessfully.
> >>>>> >> 2) Even though parse-tika DOES utilise the language-identifier to
> >>>>> >> recognize the lang field and provide a value, it fails to include
> the
> >>>>> >> full value which should be lang="en-GB" as oppose to lang="en"
> >>>>> >>
> >>>>> >> Can anyone chime in on what the current state of affairs is with
> >>>>> >> delegation of language detection to parse-tika, or whether this as
> >>>>> >> already the case but needs patched to accommodate the scenario I
> >>>>> >> provide above?
> >>>>> >>
> >>>>> >> Thanks
> >>>>> >>
> >>>>> >> Lewis
> >>>>> >>
> >>>>> >> On Fri, Jun 8, 2012 at 5:07 AM, Shameema Umer <[email protected]>
> >>>>> wrote:
> >>>>> >> > Hi Lewis,
> >>>>> >> >
> >>>>> >> > My solrindex-mapping contains
> >>>>> >> > <mapping>
> >>>>> >> >        <!-- Simple mapping of fields created by Nutch
> >>>>> IndexingFilters
> >>>>> >> >             to fields defined (and expected) in Solr schema.xml.
> >>>>> >> >
> >>>>> >> >             Any fields in NutchDocument that match a name
> defined
> >>>>> >> >             in field/@source will be renamed to the
> corresponding
> >>>>> >> >             field/@dest.
> >>>>> >> >             Additionally, if a field name (before mapping)
> matches
> >>>>> >> >             a copyField/@source then its values will be copied
> to
> >>>>> >> >             the corresponding copyField/@dest.
> >>>>> >> >
> >>>>> >> >             uniqueKey has the same meaning as in Solr schema.xml
> >>>>> >> >             and defaults to "id" if not defined.
> >>>>> >> >         -->
> >>>>> >> >        <fields>
> >>>>> >> >                <field dest="content" source="content"/>
> >>>>> >> >                <field dest="site" source="site"/>
> >>>>> >> >                <field dest="title" source="title"/>
> >>>>> >> >                <field dest="host" source="host"/>
> >>>>> >> >                <field dest="segment" source="segment"/>
> >>>>> >> >                <field dest="boost" source="boost"/>
> >>>>> >> >                <field dest="digest" source="digest"/>
> >>>>> >> >                <field dest="tstamp" source="tstamp"/>
> >>>>> >> >                <field dest="publishedDate"
> source="publishedDate"/>
> >>>>> >> >                <field dest="id" source="url"/>
> >>>>> >> >                <copyField source="url" dest="url"/>
> >>>>> >> >        </fields>
> >>>>> >> >        <uniqueKey>id</uniqueKey>
> >>>>> >> > </mapping>
> >>>>> >> >
> >>>>> >> >
> >>>>> >> > Do I need to edit any source code of feed plugin to make
> available
> >>>>> >> > this publishedDate.
> >>>>> >> >
> >>>>> >> > Thanks
> >>>>> >> > Shameema
> >>>>> >> >
> >>>>> >> > On Thu, Jun 7, 2012 at 4:32 PM, Lewis John Mcgibbney
> >>>>> >> > <[email protected]> wrote:
> >>>>> >> >> Best way to test this is by doing ad-hoc parsechecker fetches.
> >>>>> Also
> >>>>> >> >> try including this value in your solr-mapping file.
> >>>>> >> >>
> >>>>> >> >> On Thu, Jun 7, 2012 at 11:41 AM, Shameema Umer <
> [email protected]
> >>>>> >
> >>>>> >> wrote:
> >>>>> >> >>> In my schema there are certain fields used for feed plugin.
> >>>>> >> >>>
> >>>>> >> >>>        <!-- fields for feed plugin (tag is also used by
> >>>>> >> >>> microformats-reltag)-->
> >>>>> >> >>>        <field name="author" type="string" stored="true"
> >>>>> >> indexed="true"/>
> >>>>> >> >>>        <field name="tag" type="string" stored="true"
> >>>>> indexed="true"
> >>>>> >> >>> multiValued="true"/>
> >>>>> >> >>>        <field name="feed" type="string" stored="true"
> >>>>> indexed="true"/>
> >>>>> >> >>>        <field name="publishedDate" type="date" stored="true"
> >>>>> >> >>>            indexed="true"/>
> >>>>> >> >>>        <field name="updatedDate" type="date" stored="true"
> >>>>> >> >>>            indexed="true"/>
> >>>>> >> >>>
> >>>>> >> >>> I have included the feed plugin in nutch site xml. The feed
> file
> >>>>> is
> >>>>> >> fetched
> >>>>> >> >>> and parsed , also the links in it are working properly. But I
> >>>>> cannot
> >>>>> >> get
> >>>>> >> >>> the publishedDate working.
> >>>>> >> >>> I cannot retrieve the publishedDate or sort by it.
> >>>>> >> >>>
> >>>>> >> >>> Please help.
> >>>>> >> >>
> >>>>> >> >>
> >>>>> >> >>
> >>>>> >> >> --
> >>>>> >> >> Lewis
> >>>>> >>
> >>>>> >>
> >>>>> >>
> >>>>> >> --
> >>>>> >> Lewis
> >>>>> >>
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> Lewis
> >>>>>
> >>>>
> >>>>
> >>>
> >>
>
>
>
> --
> Lewis
>

Reply via email to