I tried parsechecker to ensure that no value is retrieved to publishedDate.
On Wed, Jun 13, 2012 at 6:22 PM, Shameema Umer <[email protected]> wrote: > Hi, > > I am trying for days to get a solution to retrive the <pubDate> value of a > feed. Even the value is there on a feed, nutch is not parsing it and > sending along with the outlinks. > > the feed plugin is included, but it is not populating value in the field > publishedDate. Somebody please give me hints where I went wrong. > > Or please let me know if it is not possible. > > Thanks > Shameema > > > On Sat, Jun 9, 2012 at 4:13 PM, Shameema Umer <[email protected]> wrote: > >> Thanks Lewis. >> >> >> On Sat, Jun 9, 2012 at 1:34 PM, Lewis John Mcgibbney < >> [email protected]> wrote: >> >>> Hi Shameena, >>> >>> I think this depends directly on what tags/elements are within the >>> feed(s). From the feeds I looked at yesterday the relevant tags >>> appeared to be missing. I was surprised that Tika didn't pick up more >>> so I think I'll head over and see exactly what the Tika 1.1 source >>> looks like for the rss+xml parser. >>> >>> In the meantime the feed plugin packaged with Nutch WILL parse and >>> index these additional fields if they are present, but will not if >>> they are absent. >>> >>> Lewis >>> >>> On Fri, Jun 8, 2012 at 6:32 PM, Shameema Umer <[email protected]> wrote: >>> > Hi Lewis, the things are clear, I am upset that I cannot find a means >>> to >>> > find the age of a web page by nutch. I thought publishedDate from the >>> feed >>> > plugin would help. If I change the field name from publishedDate to >>> *pubDate >>> > * . Will this help? >>> > >>> > Thanks >>> > Shameema >>> > >>> > >>> > On Fri, Jun 8, 2012 at 6:48 PM, Lewis John Mcgibbney < >>> > [email protected]> wrote: >>> > >>> >> Hi, >>> >> >>> >> No This should not be necessary. The feed parser and accompanying >>> >> indexingfilter should extract and send (to be indexed) the following >>> >> metadata items >>> >> Author, Tags, Pub;lished date, Updated date and feed, >>> >> >>> >> There is a problem though... >>> >> >>> >> With many feeds, including the bbci one you provided in another >>> >> thread, many of these fields are absent, the parser and indexing >>> >> filter cannot operate on our behalf and subsequently leaves these >>> >> fields out. >>> >> >>> >> It is also important to note that in parse-plugins.xml we first try to >>> >> parse the application/rss+xml mimetype with parse-tika before feed... >>> >> I can only assume this is because parse-tika produces slightly better >>> >> results for this mimetype. Let me explain >>> >> >>> >> With language identifier included and parse-plugins overridden to >>> >> parse rss+xml solely with feed plugin I get >>> >> >>> >> lewismc@lewismc-HP-Mini-110-3100:~/ASF/trunk/runtime/local$ bin/nutch >>> >> indexchecker http://feeds.feedburner.com/gov/GCC?format=xml >>> >> fetching: http://feeds.feedburner.com/gov/GCC?format=xml >>> >> parsing: http://feeds.feedburner.com/gov/GCC?format=xml >>> >> contentType: application/rss+xml >>> >> content : >>> >> host : feeds.feedburner.com >>> >> tstamp : Fri Jun 08 14:04:04 BST 2012 >>> >> lang : unknown >>> >> url : http://feeds.feedburner.com/gov/GCC?format=xml >>> >> >>> >> however with parse-tika initiated and the same fetch I get >>> >> >>> >> lewismc@lewismc-HP-Mini-110-3100:~/ASF/trunk/runtime/local$ bin/nutch >>> >> indexchecker http://feeds.feedburner.com/gov/GCC?format=xml >>> >> fetching: http://feeds.feedburner.com/gov/GCC?format=xml >>> >> parsing: http://feeds.feedburner.com/gov/GCC?format=xml >>> >> contentType: application/rss+xml >>> >> content : Glasgow City Council - News Feed Glasgow City Council >>> - >>> >> News >>> >> Feed Keep up to date with all the news >>> >> title : Glasgow City Council - News Feed >>> >> host : feeds.feedburner.com >>> >> tstamp : Fri Jun 08 14:04:25 BST 2012 >>> >> lang : en >>> >> url : http://feeds.feedburner.com/gov/GCC?format=xml >>> >> >>> >> Please note that this feed does not include info like publishedDate, >>> >> updatedDate etc instead offering other means of expressing (some) of >>> >> this information. In the above case, as the parse data is not present >>> >> for the required feed fields, or for arguments sake parse-tika, these >>> >> fields are not included in our subsequent index fields. >>> >> >>> >> I hope this clears things up a bit. >>> >> >>> >> On a sidenote, also some things to pick up from the above excepts from >>> >> some tests; >>> >> 1) Feed plugin fails to recognize content, title and lang fields where >>> >> parse-tika does this sucessfully. >>> >> 2) Even though parse-tika DOES utilise the language-identifier to >>> >> recognize the lang field and provide a value, it fails to include the >>> >> full value which should be lang="en-GB" as oppose to lang="en" >>> >> >>> >> Can anyone chime in on what the current state of affairs is with >>> >> delegation of language detection to parse-tika, or whether this as >>> >> already the case but needs patched to accommodate the scenario I >>> >> provide above? >>> >> >>> >> Thanks >>> >> >>> >> Lewis >>> >> >>> >> On Fri, Jun 8, 2012 at 5:07 AM, Shameema Umer <[email protected]> >>> wrote: >>> >> > Hi Lewis, >>> >> > >>> >> > My solrindex-mapping contains >>> >> > <mapping> >>> >> > <!-- Simple mapping of fields created by Nutch >>> IndexingFilters >>> >> > to fields defined (and expected) in Solr schema.xml. >>> >> > >>> >> > Any fields in NutchDocument that match a name defined >>> >> > in field/@source will be renamed to the corresponding >>> >> > field/@dest. >>> >> > Additionally, if a field name (before mapping) matches >>> >> > a copyField/@source then its values will be copied to >>> >> > the corresponding copyField/@dest. >>> >> > >>> >> > uniqueKey has the same meaning as in Solr schema.xml >>> >> > and defaults to "id" if not defined. >>> >> > --> >>> >> > <fields> >>> >> > <field dest="content" source="content"/> >>> >> > <field dest="site" source="site"/> >>> >> > <field dest="title" source="title"/> >>> >> > <field dest="host" source="host"/> >>> >> > <field dest="segment" source="segment"/> >>> >> > <field dest="boost" source="boost"/> >>> >> > <field dest="digest" source="digest"/> >>> >> > <field dest="tstamp" source="tstamp"/> >>> >> > <field dest="publishedDate" source="publishedDate"/> >>> >> > <field dest="id" source="url"/> >>> >> > <copyField source="url" dest="url"/> >>> >> > </fields> >>> >> > <uniqueKey>id</uniqueKey> >>> >> > </mapping> >>> >> > >>> >> > >>> >> > Do I need to edit any source code of feed plugin to make available >>> >> > this publishedDate. >>> >> > >>> >> > Thanks >>> >> > Shameema >>> >> > >>> >> > On Thu, Jun 7, 2012 at 4:32 PM, Lewis John Mcgibbney >>> >> > <[email protected]> wrote: >>> >> >> Best way to test this is by doing ad-hoc parsechecker fetches. Also >>> >> >> try including this value in your solr-mapping file. >>> >> >> >>> >> >> On Thu, Jun 7, 2012 at 11:41 AM, Shameema Umer <[email protected]> >>> >> wrote: >>> >> >>> In my schema there are certain fields used for feed plugin. >>> >> >>> >>> >> >>> <!-- fields for feed plugin (tag is also used by >>> >> >>> microformats-reltag)--> >>> >> >>> <field name="author" type="string" stored="true" >>> >> indexed="true"/> >>> >> >>> <field name="tag" type="string" stored="true" >>> indexed="true" >>> >> >>> multiValued="true"/> >>> >> >>> <field name="feed" type="string" stored="true" >>> indexed="true"/> >>> >> >>> <field name="publishedDate" type="date" stored="true" >>> >> >>> indexed="true"/> >>> >> >>> <field name="updatedDate" type="date" stored="true" >>> >> >>> indexed="true"/> >>> >> >>> >>> >> >>> I have included the feed plugin in nutch site xml. The feed file >>> is >>> >> fetched >>> >> >>> and parsed , also the links in it are working properly. But I >>> cannot >>> >> get >>> >> >>> the publishedDate working. >>> >> >>> I cannot retrieve the publishedDate or sort by it. >>> >> >>> >>> >> >>> Please help. >>> >> >> >>> >> >> >>> >> >> >>> >> >> -- >>> >> >> Lewis >>> >> >>> >> >>> >> >>> >> -- >>> >> Lewis >>> >> >>> >>> >>> >>> -- >>> Lewis >>> >> >> >

