Hi Lewis, The feed you provided http://feeds.feedburner.com/gov/GCC?format=xml has the pubDate tag. Then why is it not parsed. Please explain.
What i need is the value of the pubDate pulled to any of our date fields. Thanks Shameema On Wed, Jun 13, 2012 at 6:28 PM, Shameema Umer <[email protected]> wrote: > I tried parsechecker to ensure that no value is retrieved to publishedDate. > > > On Wed, Jun 13, 2012 at 6:22 PM, Shameema Umer <[email protected]> wrote: > >> Hi, >> >> I am trying for days to get a solution to retrive the <pubDate> value of >> a feed. Even the value is there on a feed, nutch is not parsing it and >> sending along with the outlinks. >> >> the feed plugin is included, but it is not populating value in the field >> publishedDate. Somebody please give me hints where I went wrong. >> >> Or please let me know if it is not possible. >> >> Thanks >> Shameema >> >> >> On Sat, Jun 9, 2012 at 4:13 PM, Shameema Umer <[email protected]> wrote: >> >>> Thanks Lewis. >>> >>> >>> On Sat, Jun 9, 2012 at 1:34 PM, Lewis John Mcgibbney < >>> [email protected]> wrote: >>> >>>> Hi Shameena, >>>> >>>> I think this depends directly on what tags/elements are within the >>>> feed(s). From the feeds I looked at yesterday the relevant tags >>>> appeared to be missing. I was surprised that Tika didn't pick up more >>>> so I think I'll head over and see exactly what the Tika 1.1 source >>>> looks like for the rss+xml parser. >>>> >>>> In the meantime the feed plugin packaged with Nutch WILL parse and >>>> index these additional fields if they are present, but will not if >>>> they are absent. >>>> >>>> Lewis >>>> >>>> On Fri, Jun 8, 2012 at 6:32 PM, Shameema Umer <[email protected]> >>>> wrote: >>>> > Hi Lewis, the things are clear, I am upset that I cannot find a means >>>> to >>>> > find the age of a web page by nutch. I thought publishedDate from the >>>> feed >>>> > plugin would help. If I change the field name from publishedDate to >>>> *pubDate >>>> > * . Will this help? >>>> > >>>> > Thanks >>>> > Shameema >>>> > >>>> > >>>> > On Fri, Jun 8, 2012 at 6:48 PM, Lewis John Mcgibbney < >>>> > [email protected]> wrote: >>>> > >>>> >> Hi, >>>> >> >>>> >> No This should not be necessary. The feed parser and accompanying >>>> >> indexingfilter should extract and send (to be indexed) the following >>>> >> metadata items >>>> >> Author, Tags, Pub;lished date, Updated date and feed, >>>> >> >>>> >> There is a problem though... >>>> >> >>>> >> With many feeds, including the bbci one you provided in another >>>> >> thread, many of these fields are absent, the parser and indexing >>>> >> filter cannot operate on our behalf and subsequently leaves these >>>> >> fields out. >>>> >> >>>> >> It is also important to note that in parse-plugins.xml we first try >>>> to >>>> >> parse the application/rss+xml mimetype with parse-tika before feed... >>>> >> I can only assume this is because parse-tika produces slightly better >>>> >> results for this mimetype. Let me explain >>>> >> >>>> >> With language identifier included and parse-plugins overridden to >>>> >> parse rss+xml solely with feed plugin I get >>>> >> >>>> >> lewismc@lewismc-HP-Mini-110-3100:~/ASF/trunk/runtime/local$ >>>> bin/nutch >>>> >> indexchecker http://feeds.feedburner.com/gov/GCC?format=xml >>>> >> fetching: http://feeds.feedburner.com/gov/GCC?format=xml >>>> >> parsing: http://feeds.feedburner.com/gov/GCC?format=xml >>>> >> contentType: application/rss+xml >>>> >> content : >>>> >> host : feeds.feedburner.com >>>> >> tstamp : Fri Jun 08 14:04:04 BST 2012 >>>> >> lang : unknown >>>> >> url : http://feeds.feedburner.com/gov/GCC?format=xml >>>> >> >>>> >> however with parse-tika initiated and the same fetch I get >>>> >> >>>> >> lewismc@lewismc-HP-Mini-110-3100:~/ASF/trunk/runtime/local$ >>>> bin/nutch >>>> >> indexchecker http://feeds.feedburner.com/gov/GCC?format=xml >>>> >> fetching: http://feeds.feedburner.com/gov/GCC?format=xml >>>> >> parsing: http://feeds.feedburner.com/gov/GCC?format=xml >>>> >> contentType: application/rss+xml >>>> >> content : Glasgow City Council - News Feed Glasgow City >>>> Council - >>>> >> News >>>> >> Feed Keep up to date with all the news >>>> >> title : Glasgow City Council - News Feed >>>> >> host : feeds.feedburner.com >>>> >> tstamp : Fri Jun 08 14:04:25 BST 2012 >>>> >> lang : en >>>> >> url : http://feeds.feedburner.com/gov/GCC?format=xml >>>> >> >>>> >> Please note that this feed does not include info like publishedDate, >>>> >> updatedDate etc instead offering other means of expressing (some) of >>>> >> this information. In the above case, as the parse data is not present >>>> >> for the required feed fields, or for arguments sake parse-tika, these >>>> >> fields are not included in our subsequent index fields. >>>> >> >>>> >> I hope this clears things up a bit. >>>> >> >>>> >> On a sidenote, also some things to pick up from the above excepts >>>> from >>>> >> some tests; >>>> >> 1) Feed plugin fails to recognize content, title and lang fields >>>> where >>>> >> parse-tika does this sucessfully. >>>> >> 2) Even though parse-tika DOES utilise the language-identifier to >>>> >> recognize the lang field and provide a value, it fails to include the >>>> >> full value which should be lang="en-GB" as oppose to lang="en" >>>> >> >>>> >> Can anyone chime in on what the current state of affairs is with >>>> >> delegation of language detection to parse-tika, or whether this as >>>> >> already the case but needs patched to accommodate the scenario I >>>> >> provide above? >>>> >> >>>> >> Thanks >>>> >> >>>> >> Lewis >>>> >> >>>> >> On Fri, Jun 8, 2012 at 5:07 AM, Shameema Umer <[email protected]> >>>> wrote: >>>> >> > Hi Lewis, >>>> >> > >>>> >> > My solrindex-mapping contains >>>> >> > <mapping> >>>> >> > <!-- Simple mapping of fields created by Nutch >>>> IndexingFilters >>>> >> > to fields defined (and expected) in Solr schema.xml. >>>> >> > >>>> >> > Any fields in NutchDocument that match a name defined >>>> >> > in field/@source will be renamed to the corresponding >>>> >> > field/@dest. >>>> >> > Additionally, if a field name (before mapping) matches >>>> >> > a copyField/@source then its values will be copied to >>>> >> > the corresponding copyField/@dest. >>>> >> > >>>> >> > uniqueKey has the same meaning as in Solr schema.xml >>>> >> > and defaults to "id" if not defined. >>>> >> > --> >>>> >> > <fields> >>>> >> > <field dest="content" source="content"/> >>>> >> > <field dest="site" source="site"/> >>>> >> > <field dest="title" source="title"/> >>>> >> > <field dest="host" source="host"/> >>>> >> > <field dest="segment" source="segment"/> >>>> >> > <field dest="boost" source="boost"/> >>>> >> > <field dest="digest" source="digest"/> >>>> >> > <field dest="tstamp" source="tstamp"/> >>>> >> > <field dest="publishedDate" source="publishedDate"/> >>>> >> > <field dest="id" source="url"/> >>>> >> > <copyField source="url" dest="url"/> >>>> >> > </fields> >>>> >> > <uniqueKey>id</uniqueKey> >>>> >> > </mapping> >>>> >> > >>>> >> > >>>> >> > Do I need to edit any source code of feed plugin to make available >>>> >> > this publishedDate. >>>> >> > >>>> >> > Thanks >>>> >> > Shameema >>>> >> > >>>> >> > On Thu, Jun 7, 2012 at 4:32 PM, Lewis John Mcgibbney >>>> >> > <[email protected]> wrote: >>>> >> >> Best way to test this is by doing ad-hoc parsechecker fetches. >>>> Also >>>> >> >> try including this value in your solr-mapping file. >>>> >> >> >>>> >> >> On Thu, Jun 7, 2012 at 11:41 AM, Shameema Umer <[email protected] >>>> > >>>> >> wrote: >>>> >> >>> In my schema there are certain fields used for feed plugin. >>>> >> >>> >>>> >> >>> <!-- fields for feed plugin (tag is also used by >>>> >> >>> microformats-reltag)--> >>>> >> >>> <field name="author" type="string" stored="true" >>>> >> indexed="true"/> >>>> >> >>> <field name="tag" type="string" stored="true" >>>> indexed="true" >>>> >> >>> multiValued="true"/> >>>> >> >>> <field name="feed" type="string" stored="true" >>>> indexed="true"/> >>>> >> >>> <field name="publishedDate" type="date" stored="true" >>>> >> >>> indexed="true"/> >>>> >> >>> <field name="updatedDate" type="date" stored="true" >>>> >> >>> indexed="true"/> >>>> >> >>> >>>> >> >>> I have included the feed plugin in nutch site xml. The feed file >>>> is >>>> >> fetched >>>> >> >>> and parsed , also the links in it are working properly. But I >>>> cannot >>>> >> get >>>> >> >>> the publishedDate working. >>>> >> >>> I cannot retrieve the publishedDate or sort by it. >>>> >> >>> >>>> >> >>> Please help. >>>> >> >> >>>> >> >> >>>> >> >> >>>> >> >> -- >>>> >> >> Lewis >>>> >> >>>> >> >>>> >> >>>> >> -- >>>> >> Lewis >>>> >> >>>> >>>> >>>> >>>> -- >>>> Lewis >>>> >>> >>> >> >

