Hi Lewis, the things are clear, I am upset that I cannot find a means to find the age of a web page by nutch. I thought publishedDate from the feed plugin would help. If I change the field name from publishedDate to *pubDate * . Will this help?
Thanks Shameema On Fri, Jun 8, 2012 at 6:48 PM, Lewis John Mcgibbney < [email protected]> wrote: > Hi, > > No This should not be necessary. The feed parser and accompanying > indexingfilter should extract and send (to be indexed) the following > metadata items > Author, Tags, Pub;lished date, Updated date and feed, > > There is a problem though... > > With many feeds, including the bbci one you provided in another > thread, many of these fields are absent, the parser and indexing > filter cannot operate on our behalf and subsequently leaves these > fields out. > > It is also important to note that in parse-plugins.xml we first try to > parse the application/rss+xml mimetype with parse-tika before feed... > I can only assume this is because parse-tika produces slightly better > results for this mimetype. Let me explain > > With language identifier included and parse-plugins overridden to > parse rss+xml solely with feed plugin I get > > lewismc@lewismc-HP-Mini-110-3100:~/ASF/trunk/runtime/local$ bin/nutch > indexchecker http://feeds.feedburner.com/gov/GCC?format=xml > fetching: http://feeds.feedburner.com/gov/GCC?format=xml > parsing: http://feeds.feedburner.com/gov/GCC?format=xml > contentType: application/rss+xml > content : > host : feeds.feedburner.com > tstamp : Fri Jun 08 14:04:04 BST 2012 > lang : unknown > url : http://feeds.feedburner.com/gov/GCC?format=xml > > however with parse-tika initiated and the same fetch I get > > lewismc@lewismc-HP-Mini-110-3100:~/ASF/trunk/runtime/local$ bin/nutch > indexchecker http://feeds.feedburner.com/gov/GCC?format=xml > fetching: http://feeds.feedburner.com/gov/GCC?format=xml > parsing: http://feeds.feedburner.com/gov/GCC?format=xml > contentType: application/rss+xml > content : Glasgow City Council - News Feed Glasgow City Council - > News > Feed Keep up to date with all the news > title : Glasgow City Council - News Feed > host : feeds.feedburner.com > tstamp : Fri Jun 08 14:04:25 BST 2012 > lang : en > url : http://feeds.feedburner.com/gov/GCC?format=xml > > Please note that this feed does not include info like publishedDate, > updatedDate etc instead offering other means of expressing (some) of > this information. In the above case, as the parse data is not present > for the required feed fields, or for arguments sake parse-tika, these > fields are not included in our subsequent index fields. > > I hope this clears things up a bit. > > On a sidenote, also some things to pick up from the above excepts from > some tests; > 1) Feed plugin fails to recognize content, title and lang fields where > parse-tika does this sucessfully. > 2) Even though parse-tika DOES utilise the language-identifier to > recognize the lang field and provide a value, it fails to include the > full value which should be lang="en-GB" as oppose to lang="en" > > Can anyone chime in on what the current state of affairs is with > delegation of language detection to parse-tika, or whether this as > already the case but needs patched to accommodate the scenario I > provide above? > > Thanks > > Lewis > > On Fri, Jun 8, 2012 at 5:07 AM, Shameema Umer <[email protected]> wrote: > > Hi Lewis, > > > > My solrindex-mapping contains > > <mapping> > > <!-- Simple mapping of fields created by Nutch IndexingFilters > > to fields defined (and expected) in Solr schema.xml. > > > > Any fields in NutchDocument that match a name defined > > in field/@source will be renamed to the corresponding > > field/@dest. > > Additionally, if a field name (before mapping) matches > > a copyField/@source then its values will be copied to > > the corresponding copyField/@dest. > > > > uniqueKey has the same meaning as in Solr schema.xml > > and defaults to "id" if not defined. > > --> > > <fields> > > <field dest="content" source="content"/> > > <field dest="site" source="site"/> > > <field dest="title" source="title"/> > > <field dest="host" source="host"/> > > <field dest="segment" source="segment"/> > > <field dest="boost" source="boost"/> > > <field dest="digest" source="digest"/> > > <field dest="tstamp" source="tstamp"/> > > <field dest="publishedDate" source="publishedDate"/> > > <field dest="id" source="url"/> > > <copyField source="url" dest="url"/> > > </fields> > > <uniqueKey>id</uniqueKey> > > </mapping> > > > > > > Do I need to edit any source code of feed plugin to make available > > this publishedDate. > > > > Thanks > > Shameema > > > > On Thu, Jun 7, 2012 at 4:32 PM, Lewis John Mcgibbney > > <[email protected]> wrote: > >> Best way to test this is by doing ad-hoc parsechecker fetches. Also > >> try including this value in your solr-mapping file. > >> > >> On Thu, Jun 7, 2012 at 11:41 AM, Shameema Umer <[email protected]> > wrote: > >>> In my schema there are certain fields used for feed plugin. > >>> > >>> <!-- fields for feed plugin (tag is also used by > >>> microformats-reltag)--> > >>> <field name="author" type="string" stored="true" > indexed="true"/> > >>> <field name="tag" type="string" stored="true" indexed="true" > >>> multiValued="true"/> > >>> <field name="feed" type="string" stored="true" indexed="true"/> > >>> <field name="publishedDate" type="date" stored="true" > >>> indexed="true"/> > >>> <field name="updatedDate" type="date" stored="true" > >>> indexed="true"/> > >>> > >>> I have included the feed plugin in nutch site xml. The feed file is > fetched > >>> and parsed , also the links in it are working properly. But I cannot > get > >>> the publishedDate working. > >>> I cannot retrieve the publishedDate or sort by it. > >>> > >>> Please help. > >> > >> > >> > >> -- > >> Lewis > > > > -- > Lewis >

