Hi, No This should not be necessary. The feed parser and accompanying indexingfilter should extract and send (to be indexed) the following metadata items Author, Tags, Pub;lished date, Updated date and feed,
There is a problem though... With many feeds, including the bbci one you provided in another thread, many of these fields are absent, the parser and indexing filter cannot operate on our behalf and subsequently leaves these fields out. It is also important to note that in parse-plugins.xml we first try to parse the application/rss+xml mimetype with parse-tika before feed... I can only assume this is because parse-tika produces slightly better results for this mimetype. Let me explain With language identifier included and parse-plugins overridden to parse rss+xml solely with feed plugin I get lewismc@lewismc-HP-Mini-110-3100:~/ASF/trunk/runtime/local$ bin/nutch indexchecker http://feeds.feedburner.com/gov/GCC?format=xml fetching: http://feeds.feedburner.com/gov/GCC?format=xml parsing: http://feeds.feedburner.com/gov/GCC?format=xml contentType: application/rss+xml content : host : feeds.feedburner.com tstamp : Fri Jun 08 14:04:04 BST 2012 lang : unknown url : http://feeds.feedburner.com/gov/GCC?format=xml however with parse-tika initiated and the same fetch I get lewismc@lewismc-HP-Mini-110-3100:~/ASF/trunk/runtime/local$ bin/nutch indexchecker http://feeds.feedburner.com/gov/GCC?format=xml fetching: http://feeds.feedburner.com/gov/GCC?format=xml parsing: http://feeds.feedburner.com/gov/GCC?format=xml contentType: application/rss+xml content : Glasgow City Council - News Feed Glasgow City Council - News Feed Keep up to date with all the news title : Glasgow City Council - News Feed host : feeds.feedburner.com tstamp : Fri Jun 08 14:04:25 BST 2012 lang : en url : http://feeds.feedburner.com/gov/GCC?format=xml Please note that this feed does not include info like publishedDate, updatedDate etc instead offering other means of expressing (some) of this information. In the above case, as the parse data is not present for the required feed fields, or for arguments sake parse-tika, these fields are not included in our subsequent index fields. I hope this clears things up a bit. On a sidenote, also some things to pick up from the above excepts from some tests; 1) Feed plugin fails to recognize content, title and lang fields where parse-tika does this sucessfully. 2) Even though parse-tika DOES utilise the language-identifier to recognize the lang field and provide a value, it fails to include the full value which should be lang="en-GB" as oppose to lang="en" Can anyone chime in on what the current state of affairs is with delegation of language detection to parse-tika, or whether this as already the case but needs patched to accommodate the scenario I provide above? Thanks Lewis On Fri, Jun 8, 2012 at 5:07 AM, Shameema Umer <[email protected]> wrote: > Hi Lewis, > > My solrindex-mapping contains > <mapping> > <!-- Simple mapping of fields created by Nutch IndexingFilters > to fields defined (and expected) in Solr schema.xml. > > Any fields in NutchDocument that match a name defined > in field/@source will be renamed to the corresponding > field/@dest. > Additionally, if a field name (before mapping) matches > a copyField/@source then its values will be copied to > the corresponding copyField/@dest. > > uniqueKey has the same meaning as in Solr schema.xml > and defaults to "id" if not defined. > --> > <fields> > <field dest="content" source="content"/> > <field dest="site" source="site"/> > <field dest="title" source="title"/> > <field dest="host" source="host"/> > <field dest="segment" source="segment"/> > <field dest="boost" source="boost"/> > <field dest="digest" source="digest"/> > <field dest="tstamp" source="tstamp"/> > <field dest="publishedDate" source="publishedDate"/> > <field dest="id" source="url"/> > <copyField source="url" dest="url"/> > </fields> > <uniqueKey>id</uniqueKey> > </mapping> > > > Do I need to edit any source code of feed plugin to make available > this publishedDate. > > Thanks > Shameema > > On Thu, Jun 7, 2012 at 4:32 PM, Lewis John Mcgibbney > <[email protected]> wrote: >> Best way to test this is by doing ad-hoc parsechecker fetches. Also >> try including this value in your solr-mapping file. >> >> On Thu, Jun 7, 2012 at 11:41 AM, Shameema Umer <[email protected]> wrote: >>> In my schema there are certain fields used for feed plugin. >>> >>> <!-- fields for feed plugin (tag is also used by >>> microformats-reltag)--> >>> <field name="author" type="string" stored="true" indexed="true"/> >>> <field name="tag" type="string" stored="true" indexed="true" >>> multiValued="true"/> >>> <field name="feed" type="string" stored="true" indexed="true"/> >>> <field name="publishedDate" type="date" stored="true" >>> indexed="true"/> >>> <field name="updatedDate" type="date" stored="true" >>> indexed="true"/> >>> >>> I have included the feed plugin in nutch site xml. The feed file is fetched >>> and parsed , also the links in it are working properly. But I cannot get >>> the publishedDate working. >>> I cannot retrieve the publishedDate or sort by it. >>> >>> Please help. >> >> >> >> -- >> Lewis -- Lewis

