Depending on what the tag looks like it will be interpreted accordingly by the feed parser. My instincts are that there is a different between pubDate and publishedDate being parsed and identified by the parser, however then the question arises as to how/why the field is not identified as a tag.
I will try to do more digging.. it might be worth looking at the feed source as well. Best Lewis On Thu, Jun 14, 2012 at 7:04 AM, Shameema Umer <[email protected]> wrote: > Hi Lewis, > > The feed you provided http://feeds.feedburner.com/gov/GCC?format=xml has > the pubDate tag. > Then why is it not parsed. Please explain. > > What i need is the value of the pubDate > pulled to any of our date fields. > > Thanks > Shameema > > > > On Wed, Jun 13, 2012 at 6:28 PM, Shameema Umer <[email protected]> wrote: > >> I tried parsechecker to ensure that no value is retrieved to publishedDate. >> >> >> On Wed, Jun 13, 2012 at 6:22 PM, Shameema Umer <[email protected]> wrote: >> >>> Hi, >>> >>> I am trying for days to get a solution to retrive the <pubDate> value of >>> a feed. Even the value is there on a feed, nutch is not parsing it and >>> sending along with the outlinks. >>> >>> the feed plugin is included, but it is not populating value in the field >>> publishedDate. Somebody please give me hints where I went wrong. >>> >>> Or please let me know if it is not possible. >>> >>> Thanks >>> Shameema >>> >>> >>> On Sat, Jun 9, 2012 at 4:13 PM, Shameema Umer <[email protected]> wrote: >>> >>>> Thanks Lewis. >>>> >>>> >>>> On Sat, Jun 9, 2012 at 1:34 PM, Lewis John Mcgibbney < >>>> [email protected]> wrote: >>>> >>>>> Hi Shameena, >>>>> >>>>> I think this depends directly on what tags/elements are within the >>>>> feed(s). From the feeds I looked at yesterday the relevant tags >>>>> appeared to be missing. I was surprised that Tika didn't pick up more >>>>> so I think I'll head over and see exactly what the Tika 1.1 source >>>>> looks like for the rss+xml parser. >>>>> >>>>> In the meantime the feed plugin packaged with Nutch WILL parse and >>>>> index these additional fields if they are present, but will not if >>>>> they are absent. >>>>> >>>>> Lewis >>>>> >>>>> On Fri, Jun 8, 2012 at 6:32 PM, Shameema Umer <[email protected]> >>>>> wrote: >>>>> > Hi Lewis, the things are clear, I am upset that I cannot find a means >>>>> to >>>>> > find the age of a web page by nutch. I thought publishedDate from the >>>>> feed >>>>> > plugin would help. If I change the field name from publishedDate to >>>>> *pubDate >>>>> > * . Will this help? >>>>> > >>>>> > Thanks >>>>> > Shameema >>>>> > >>>>> > >>>>> > On Fri, Jun 8, 2012 at 6:48 PM, Lewis John Mcgibbney < >>>>> > [email protected]> wrote: >>>>> > >>>>> >> Hi, >>>>> >> >>>>> >> No This should not be necessary. The feed parser and accompanying >>>>> >> indexingfilter should extract and send (to be indexed) the following >>>>> >> metadata items >>>>> >> Author, Tags, Pub;lished date, Updated date and feed, >>>>> >> >>>>> >> There is a problem though... >>>>> >> >>>>> >> With many feeds, including the bbci one you provided in another >>>>> >> thread, many of these fields are absent, the parser and indexing >>>>> >> filter cannot operate on our behalf and subsequently leaves these >>>>> >> fields out. >>>>> >> >>>>> >> It is also important to note that in parse-plugins.xml we first try >>>>> to >>>>> >> parse the application/rss+xml mimetype with parse-tika before feed... >>>>> >> I can only assume this is because parse-tika produces slightly better >>>>> >> results for this mimetype. Let me explain >>>>> >> >>>>> >> With language identifier included and parse-plugins overridden to >>>>> >> parse rss+xml solely with feed plugin I get >>>>> >> >>>>> >> lewismc@lewismc-HP-Mini-110-3100:~/ASF/trunk/runtime/local$ >>>>> bin/nutch >>>>> >> indexchecker http://feeds.feedburner.com/gov/GCC?format=xml >>>>> >> fetching: http://feeds.feedburner.com/gov/GCC?format=xml >>>>> >> parsing: http://feeds.feedburner.com/gov/GCC?format=xml >>>>> >> contentType: application/rss+xml >>>>> >> content : >>>>> >> host : feeds.feedburner.com >>>>> >> tstamp : Fri Jun 08 14:04:04 BST 2012 >>>>> >> lang : unknown >>>>> >> url : http://feeds.feedburner.com/gov/GCC?format=xml >>>>> >> >>>>> >> however with parse-tika initiated and the same fetch I get >>>>> >> >>>>> >> lewismc@lewismc-HP-Mini-110-3100:~/ASF/trunk/runtime/local$ >>>>> bin/nutch >>>>> >> indexchecker http://feeds.feedburner.com/gov/GCC?format=xml >>>>> >> fetching: http://feeds.feedburner.com/gov/GCC?format=xml >>>>> >> parsing: http://feeds.feedburner.com/gov/GCC?format=xml >>>>> >> contentType: application/rss+xml >>>>> >> content : Glasgow City Council - News Feed Glasgow City >>>>> Council - >>>>> >> News >>>>> >> Feed Keep up to date with all the news >>>>> >> title : Glasgow City Council - News Feed >>>>> >> host : feeds.feedburner.com >>>>> >> tstamp : Fri Jun 08 14:04:25 BST 2012 >>>>> >> lang : en >>>>> >> url : http://feeds.feedburner.com/gov/GCC?format=xml >>>>> >> >>>>> >> Please note that this feed does not include info like publishedDate, >>>>> >> updatedDate etc instead offering other means of expressing (some) of >>>>> >> this information. In the above case, as the parse data is not present >>>>> >> for the required feed fields, or for arguments sake parse-tika, these >>>>> >> fields are not included in our subsequent index fields. >>>>> >> >>>>> >> I hope this clears things up a bit. >>>>> >> >>>>> >> On a sidenote, also some things to pick up from the above excepts >>>>> from >>>>> >> some tests; >>>>> >> 1) Feed plugin fails to recognize content, title and lang fields >>>>> where >>>>> >> parse-tika does this sucessfully. >>>>> >> 2) Even though parse-tika DOES utilise the language-identifier to >>>>> >> recognize the lang field and provide a value, it fails to include the >>>>> >> full value which should be lang="en-GB" as oppose to lang="en" >>>>> >> >>>>> >> Can anyone chime in on what the current state of affairs is with >>>>> >> delegation of language detection to parse-tika, or whether this as >>>>> >> already the case but needs patched to accommodate the scenario I >>>>> >> provide above? >>>>> >> >>>>> >> Thanks >>>>> >> >>>>> >> Lewis >>>>> >> >>>>> >> On Fri, Jun 8, 2012 at 5:07 AM, Shameema Umer <[email protected]> >>>>> wrote: >>>>> >> > Hi Lewis, >>>>> >> > >>>>> >> > My solrindex-mapping contains >>>>> >> > <mapping> >>>>> >> > <!-- Simple mapping of fields created by Nutch >>>>> IndexingFilters >>>>> >> > to fields defined (and expected) in Solr schema.xml. >>>>> >> > >>>>> >> > Any fields in NutchDocument that match a name defined >>>>> >> > in field/@source will be renamed to the corresponding >>>>> >> > field/@dest. >>>>> >> > Additionally, if a field name (before mapping) matches >>>>> >> > a copyField/@source then its values will be copied to >>>>> >> > the corresponding copyField/@dest. >>>>> >> > >>>>> >> > uniqueKey has the same meaning as in Solr schema.xml >>>>> >> > and defaults to "id" if not defined. >>>>> >> > --> >>>>> >> > <fields> >>>>> >> > <field dest="content" source="content"/> >>>>> >> > <field dest="site" source="site"/> >>>>> >> > <field dest="title" source="title"/> >>>>> >> > <field dest="host" source="host"/> >>>>> >> > <field dest="segment" source="segment"/> >>>>> >> > <field dest="boost" source="boost"/> >>>>> >> > <field dest="digest" source="digest"/> >>>>> >> > <field dest="tstamp" source="tstamp"/> >>>>> >> > <field dest="publishedDate" source="publishedDate"/> >>>>> >> > <field dest="id" source="url"/> >>>>> >> > <copyField source="url" dest="url"/> >>>>> >> > </fields> >>>>> >> > <uniqueKey>id</uniqueKey> >>>>> >> > </mapping> >>>>> >> > >>>>> >> > >>>>> >> > Do I need to edit any source code of feed plugin to make available >>>>> >> > this publishedDate. >>>>> >> > >>>>> >> > Thanks >>>>> >> > Shameema >>>>> >> > >>>>> >> > On Thu, Jun 7, 2012 at 4:32 PM, Lewis John Mcgibbney >>>>> >> > <[email protected]> wrote: >>>>> >> >> Best way to test this is by doing ad-hoc parsechecker fetches. >>>>> Also >>>>> >> >> try including this value in your solr-mapping file. >>>>> >> >> >>>>> >> >> On Thu, Jun 7, 2012 at 11:41 AM, Shameema Umer <[email protected] >>>>> > >>>>> >> wrote: >>>>> >> >>> In my schema there are certain fields used for feed plugin. >>>>> >> >>> >>>>> >> >>> <!-- fields for feed plugin (tag is also used by >>>>> >> >>> microformats-reltag)--> >>>>> >> >>> <field name="author" type="string" stored="true" >>>>> >> indexed="true"/> >>>>> >> >>> <field name="tag" type="string" stored="true" >>>>> indexed="true" >>>>> >> >>> multiValued="true"/> >>>>> >> >>> <field name="feed" type="string" stored="true" >>>>> indexed="true"/> >>>>> >> >>> <field name="publishedDate" type="date" stored="true" >>>>> >> >>> indexed="true"/> >>>>> >> >>> <field name="updatedDate" type="date" stored="true" >>>>> >> >>> indexed="true"/> >>>>> >> >>> >>>>> >> >>> I have included the feed plugin in nutch site xml. The feed file >>>>> is >>>>> >> fetched >>>>> >> >>> and parsed , also the links in it are working properly. But I >>>>> cannot >>>>> >> get >>>>> >> >>> the publishedDate working. >>>>> >> >>> I cannot retrieve the publishedDate or sort by it. >>>>> >> >>> >>>>> >> >>> Please help. >>>>> >> >> >>>>> >> >> >>>>> >> >> >>>>> >> >> -- >>>>> >> >> Lewis >>>>> >> >>>>> >> >>>>> >> >>>>> >> -- >>>>> >> Lewis >>>>> >> >>>>> >>>>> >>>>> >>>>> -- >>>>> Lewis >>>>> >>>> >>>> >>> >> -- Lewis

